18

I have a problem using File.list() with file names with NON-ASCII characters incorrectly retrieved on Mac OS X when using Java 7 from Oracle.

I am using the following example:

import java.io.*;
import java.util.*;

public class ListFiles {

  public static void main(String[] args) 
  {
    try { 
      File folder = new File(".");
      String[] listOfFiles = folder.list(); 
      for (int i = 0; i < listOfFiles.length; i++) 
      {
        System.out.println(listOfFiles[i]);
      }
      Map<String, String> env = System.getenv();
      for (String envName : env.keySet()) {
        System.out.format("%s=%s%n",
            envName,
            env.get(envName));
      }
    } catch (Exception e) { 
      e.printStackTrace(); 
    } 
  }

}

Running this example with Java 6 from Apple, everything is fine:

....
Folder-ÄÖÜäöüß
吃饭.txt
....

Running this example with Java 7 from Oracle, the result is as follows:

....
Folder-A��O��U��a��o��u����
������.txt
....

But, if I set the environment as follows (not set in the two cases above):

LANG=en_US.UTF-8

the result with Java 7 from Oracle is as expected:

....
Folder-ÄÖÜäöüß
吃饭.txt
....

My problem is that I don't want to set the LANG environment variable. It's a GUI application that I want to deploy as an Mac OS X application, and doing so, the LSEnvironment setting

<key>LSEnvironment</key>
<dict>
  <key>LANG</key>
  <string>en_US.UTF-8</string>
</dict>

in Info.plist takes no effect (see also here)

What can I do to retrieve the file names correctly in Java 7 from Oracle on Mac OS X without having to set the LANG environment? In Windows and Linux, this problem does not exist.

EDIT:

If I print the individual bytes with:

byte[] x = listOfFiles[i].getBytes();
for (int j = 0; j < x.length; j++) 
{
    System.out.format("%02X",x[j]);
    System.out.print(" ");
}
System.out.println();

the correct results are:

Folder-ÄÖÜäöüß
46 6F 6C 64 65 72 2D 41 CC 88 4F CC 88 55 CC 88 61 CC 88 6F CC 
88 75 CC 88 C3 9F 
吃饭.txt
E5 90 83 E9 A5 AD 2E 74 78 74 

and the wrong results are:

Folder-A��O��U��a��o��u����
46 6F 6C 64 65 72 2D 41 EF BF BD EF BF BD 4F EF BF BD EF BF BD 
55 EF BF BD EF BF BD 61 EF BF BD EF BF BD 6F EF BF BD EF BF BD 
75 EF BF BD EF BF BD EF BF BD EF BF BD  
������.txt
EF BF BD EF BF BD EF BF BD EF BF BD EF BF BD EF BF BD 2E 74 78 74 

So one can see that Files.list() replaces some bytes with UTF-8 "EF BF BD" = Unicode U+FFFD = Replacement Character, if LANG is not set (only Java 7 from Oracle).

Community
  • 1
  • 1
  • 1
    Interesting question, +1. Have you checked the [bug database](http://bugs.sun.com/)? – Andrew Thompson Oct 20 '12 at 09:58
  • 2
    Yes, and I have found http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4733494 . The conclusion of the bug report: Closed, Not a Defect. I found it interesting that Apple's Java, and Java from Oracle on other platforms than OS X does not have this behaviour. –  Oct 20 '12 at 10:10
  • 1
    I just tested this and I get the opposite problem: Java 6u35 from Apple fails to use the correct encoding, while Java 7u7 from Oracle works. What are your locale settings? Run `locale` in Terminal; I get `CTYPE` set to `UTF-8` and everything else set to `C`. `LANG` and `LC_ALL` are unset. – Joni Oct 20 '12 at 10:21
  • If I run this program within a Terminal, everything is OK in all cases, as LANG is always set to en_US.UTF-8. The problem is when running a Java program as an APP bundle, LANG is **not** set, and LANG can not be set (see end of my original post) - as far as I know. –  Oct 20 '12 at 10:30
  • Why is LANG set in Terminal? Have you modified your `.bashrc` or something like that? – Joni Oct 20 '12 at 11:36
  • 1
    This problem has been finally resolved by Oracle in Java 7u40. See http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=8003228 –  Sep 11 '13 at 15:56

5 Answers5

4

If everything else fails, create a wrapper for the JVM that sets the LC_CTYPE environment variable and then launches your application. OS X doesn't care which program the plist tells it to run does it? It's probably simplest to create this wrapper in shell script:

#!/bin/bash
export LC_CTYPE="UTF-8" # Try other options if this doesn't work
exec java your.program.Here

The problem is with the way Java - any version of Java, from either Apple or Oracle - reads the names of files from the file system. Names of files on the file system are essentially binary data, and they must be decoded in order to use them as String in Java. (You can read more about this issue in my blog.)

The detection of the encoding varies from platform to platform and version to version, so this must be where Apple Java 6 and Oracle Java 7 differ: Java 6 detects correctly that the system is set to UTF-8, while Java 7 gets it wrong.

Strangely though, when I try to reproduce the issue with the following program I find that both Java 6 and Java 7 correctly use UTF-8 to decode file names (they are printed correctly to the terminal). For other I/O, Java 6u35 is using MacRoman as the default charset, while Java 7u7 uses UTF-8 (shown by the file.encoding system property).

import java.io.*;

public class Test {
  public static void main(String[] args) {
    System.setOut(new PrintStream(System.out, true, "UTF-8"));
    System.out.println(System.getProperty("file.encoding"));
    for (File f: new File(".").listFiles) {
      System.out.println(g.getName());
    }
  }
}

When I run locale on OS 10.7 I get this output. It seems that on my system Java 6 doesn't interpret correctly the value given for LC_CTYPE. As far as I know the system has no customizations and everything is set to English, so this should be the default configuration:

LANG=
LC_COLLATE="C"
LC_CTYPE="UTF-8"
LC_MESSAGES="C"
LC_MONETARY="C"
LC_NUMERIC="C"
LC_TIME="C"
LC_ALL=
Joni
  • 108,737
  • 14
  • 143
  • 193
  • If you try to reproduce my example, remove all LANG and LC_xxx environment variables (the situation when you start an OS X application bundle). If you run it in a Terminal with LANG=en_US.UTF-8 or LANG=de_DE.UTF-8, my example runs correctly with either Apple's Java or Oracle's Java. –  Oct 20 '12 at 20:22
  • I created an application wrapper that calls a bash script that sets the environment and finally calls the original application that is also included as a resource. It works fine. Thank you. –  Oct 20 '12 at 21:07
2

Since running from Java6 gives correct result, would this:

System.out.println(new String(listOfFiles[i].getBytes(),"UTF-8"));

solve the problem?

This suggested constructor explicitly interprets the listOfFiles[i] string as an UTF-8 encoded string.

EDIT:

As it is not working it means that UTF-8 is not the default encoding for os x. Wikipedia says that Mac OS Roman is, though. So I d suggest to try:

System.out.println(new String(listOfFiles[i].getBytes(),"MacRoman"));

but that should be the same as

System.out.println(new String(listOfFiles[i].getBytes()));

So if that is not working also, that leads to conclusion that it might be a bug as Andrew Thomson stated in comment to your question.

linski
  • 5,046
  • 3
  • 22
  • 35
  • I posted it fast but did intend to edit to give explanation and link in the first place :). Do you still think it is a comment candidate? – linski Oct 20 '12 at 10:05
  • `System.out.println(new String(listOfFiles[i].getBytes(),"UTF-8"));` has no effect. The result is `Folder-A��O��U��a��o��u���� ...`. –  Oct 20 '12 at 10:06
  • UTF-8 is not the default encoding for mac os x then. – linski Oct 20 '12 at 10:10
  • @Andrew thanks for the input :) My general criteria for answer is that it must inculde at least some research, and I always run the code from my machine when posting it in answer. I agree that my initial form of answer was more of a comment, but I never intended to leave it that way in the first place. – linski Oct 20 '12 at 10:19
  • 1
    @linski Cool. I decided after reading the newer edit to up-vote. To be absolutely correct though, I speculated about the bug, while the OP found it & posted the link (+1 to them). – Andrew Thompson Oct 20 '12 at 10:22
  • `System.out.println(new String(listOfFiles[i].getBytes(),"MacRoman"));` results in `Folder-AÔøΩÔøΩOÔøΩÔøΩUÔøΩÔøΩaÔøΩÔøΩoÔøΩÔøΩuÔøΩÔøΩÔøΩÔøΩ ...`. `System.out.println(new String(listOfFiles[i].getBytes()));` results in `Folder-A��O��U��a��o��u���� ...`. –  Oct 20 '12 at 10:27
  • thanks for feedback. Since reencoding the string in the same encoding that it is originally encoded has no effect on my machine (it always prints the same string), I suppose that it means two things: MacRoman might not be your default encoding and it looks like a bug. – linski Oct 20 '12 at 10:43
  • Oh i just saw your comment about running from terminal and as APP bundle, since I'm not familiar with the OS x it might also mean that it is not necessairly a bug :/ – linski Oct 20 '12 at 10:47
  • also, see [this post](http://stackoverflow.com/questions/361975/setting-the-default-java-character-encoding) about setting the default encoding – linski Oct 20 '12 at 10:48
  • I tried `System.setProperty("file.encoding", "UTF-8");` but it did not change anything. –  Oct 20 '12 at 10:57
0

It's a known bug in OpenJDK. OS X 10.6 and OS X 10.7 return different values for the default locale. See bug http://java.net/jira/browse/MACOSX_PORT-204 and http://java.net/jira/browse/MACOSX_PORT-165. If you're having this problem, vote for getting it fixed.

0

Downgrade your JDK to the built in Mac OSX JDK. If you do, the problem should vanish.

In addition, you may also want to set your run configuration in Eclipse to run as UTF-8.

0

It's a bug in the old java File api (maybe just on a mac). Anyway, it's all fixed in the new java.nio.

I have several files containing unicode characters in the filename and content that failed to load using java.io.File and related classes. After converting all my code to use java.nio.Path EVERYTHING started working. And I replaced org.apache.commons.io.FileUtils (which has the same problem) with java.nio.Files...

...and be sure to read and write the content of file using an appropriate charset, for example: Files.readAllLines(myPath, StandardCharsets.UTF_8)

pomo
  • 2,251
  • 1
  • 21
  • 34
  • This problem has been resolved by Oracle in Java 7u40. See bugs.sun.com/bugdatabase/view_bug.do?bug_id=8003228 –  Feb 25 '14 at 21:04