9

I'm having trouble understanding the way the IBM JVM's implementation of java.io.File deals with UTF-8 on AIX on the JFS2 filesystem. I suspect there's a system property that I'm overlooking, but I have not yet been able to find it.

Let's assume I have a file named othér (where é is U+00E9 or UTF-8 bytes0xc3 0xa9). The filename is encoded in UTF-8, and was created by a C program:

char filename[] = { 'o', 't', 'h', 0xc3, 0xa9, 'r', 0 };
open(filename, O_RDWR|O_CREAT, 0666);

If I create a Unicode string in Java that is representative of the filename, it fails to open it. Further, if I use File.listFiles() in Java, it insists on treating this as a Latin1 string. For example:

String expectedName = new String(new char[] { 'o', 't', 'h', 0xe9, 'r' });
File expected = new File(expectedName);
if (expected.exists())
    System.out.println(expectedName + " exists");
else
    System.out.println(expectedName + " DOES NOT exist");

for (File child : new File(".").listFiles())
{
    System.out.println(child.getName());
    System.out.print("Chars:");
    for (char c : child.getName().toCharArray())
        System.out.print(" 0x" + Integer.toHexString((int)c));
    System.out.println();
}

The results of this program are:

% java -Dfile.encoding=UTF8 FileTest
othér DOES NOT exist
othér
Chars: 0x6f 0x74 0x68 0xc3 0xa9 0x72

So it appears that my filenames are getting treated as Latin1. I've tried setting the file.encoding system property to UTF8 and the client.encoding.override system property to UTF-8 to no avail. My LANG and LC_ALL settings are en_US.UTF-8:

% echo $LANG
en_US.UTF-8
% echo $LC_ALL
en_US.UTF-8

My system's "Primary Language Environment", as configured by SMIT, is "ISO8859-1". I don't really know the full impact this setting has, but I cannot change it. I suspect that if I could change this to "UTF8 English" then that may fix the problem, but since JFS2 stores filenames in Unicode and Java operates in Unicode internally, I feel like there should be a more general solution to the problem.

Is there another system property to J9 that I can set that will make force it to use UTF-8 filenames regardless of my SMIT setting?

AIX version is 5.2, Java version is IBM J9 (1.5.0), filesystem is JFS2:

rs6000% uname -a
AIX rs6000 2 5 000A9B7C4C00
rs6000% java -version
java version "1.5.0"
Java(TM) 2 Runtime Environment, Standard Edition (build pap32dev-20091106a (SR11 ))
IBM J9 VM (build 2.3, J2RE 1.5.0 IBM J9 2.3 AIX ppc-32 j9vmap3223-20091104 (JIT enabled)
J9VM - 20091103_45935_bHdSMr
JIT  - 20091016_1845_r8
GC   - 20091026_AA)
JCL  - 20091106
rs6000% mount|grep /home
         /dev/hd1         /home            jfs2   Jun 27 16:02 rw,log=/dev/hd8 

Update: this still occurs on Java6:

% java -version
java version "1.6.0"
Java(TM) SE Runtime Environment (build pap3260sr11-20120806_01(SR11))
IBM J9 VM (build 2.4, JRE 1.6.0 IBM J9 2.4 AIX ppc-32 jvmap3260sr11-20120801_118201 (JIT enabled, AOT enabled)
J9VM - 20120801_118201
JIT  - r9_20120608_24176ifx1
GC   - 20120516_AA)
JCL  - 20120713_01
Edward Thomson
  • 74,857
  • 14
  • 158
  • 187
  • Does `java` on AIX pick up the encoding from the locale, like it does on other Unixes? Try running the test program as `LANG=en_US.UTF-8 java FileTest` – Joni Oct 20 '12 at 15:36
  • It does not. I forgot to include that above. *However*, it's possible that UTF-8 is not valid or not installed and my `LANG` and `LC_ALL` settings are being ignored, but my lack of familiarity with SMIT makes this difficult to determine fully. – Edward Thomson Oct 20 '12 at 15:51
  • Check this question (http://stackoverflow.com/questions/1545625/java-cant-open-a-file-with-surrogate-unicode-values-in-the-filename). AFAIK there are problems in Java with opening files which filename encoding differs from the system locale. – Konstantin V. Salikhov Oct 22 '12 at 05:18
  • Just to be sure - when you run `ls` - the output is `othér`, right? – RonK Oct 23 '12 at 20:53

2 Answers2

4

I found the answer. I really am trying to help here.

This is a blog post about your actual issue. I promise.

Try running your program with the -Dsun.jnu.encoding=UTF-8 flag set.

durron597
  • 31,968
  • 17
  • 99
  • 158
  • 2
    Strings are not latin1 in Java, they are sequences of UTF-16 code-units. The two expressions you show are equivalent, with the exception that the second won't compile until you put in `(byte)` casts, and the first won't compile until you add a single `(char)` cast. – Mike Samuel Oct 24 '12 at 17:14
  • They're not Latin1. They're Unicode code points. http://docs.oracle.com/javase/1.4.2/docs/api/java/lang/String.html#String(char[]) – Edward Thomson Oct 24 '12 at 17:18
  • @durron597: I think you're mistaken, initializing a `String` with a `char[]` of Unicode code points is correct. – Edward Thomson Oct 24 '12 at 17:37
  • Creating a string that *should* represent the filename is not the problem. Creating a string that *accurately* represents the filename *is* the problem. The previous article you link to discusses how the Mac OS filesystem stores filenames as canonically decomposed always. This is not the issue here. – Edward Thomson Oct 24 '12 at 18:10
  • I'll be damned. It really does. I need to look into this more, but yes, I think that the `sun.jnu.encoding` system property does indeed affect the behavior. – Edward Thomson Oct 24 '12 at 18:39
  • @DeepakBala: I was equally dubious. But the IBM JVM does indeed read that system property. – Edward Thomson Oct 24 '12 at 18:40
  • That is interesting. I cant find any reliable documentation on this property yet. I guess this is another one of those software issues that we can file under 'It works, but I don't know why' :) – Deepak Bala Oct 24 '12 at 18:46
  • @durron597: just need to do a bit more testing - but do have an upvote in the meantime. – Edward Thomson Oct 27 '12 at 21:18
1

See here http://www.ibm.com/developerworks/java/jdk/aix/118/README.html for a list of valid AIX locales Your exports should look like this i think

  export LC_ALL=EN_US
  export LANG=EN_US
user18428
  • 1,216
  • 11
  • 17
  • My reading of that would indicate that en_US is ISO-8859-1 (aka "Latin 1") not UTF-8. – Edward Thomson Oct 28 '12 at 19:05
  • The case seems to matter . EN_US is listed as UTF8 while en_us is as ISO8859_1 – user18428 Oct 28 '12 at 20:18
  • what does the output of the command "locale -a" gives you? You should see en_US and its alias en_US.8859_1 as well as EN_US and its alias EN_US.UTF-8. It's seems stupid that two different cases refer to two different encoding but it seems to be so. – user18428 Oct 29 '12 at 14:07
  • Gah, apparently `UTF-8` is not even installed. I hadn't even thought to look at `locale -a`, thank you. Suddenly things make more sense. Unfortunately, I'm not the administrator. And thank you for pointing out the casing in that document, I hadn't noticed that there was indeed a difference between `en_US` and `EN_US`. – Edward Thomson Oct 30 '12 at 13:23
  • Can you confirm that with your locale as `en_US.UTF-8` that the JVM acts sanely? – Edward Thomson Oct 30 '12 at 13:24
  • I have'nt got an AIX box at hand for quite some times now unfortunately so i ca'nt confirm but that I really think that will do it. – user18428 Oct 30 '12 at 13:32
  • Cool. I appreciate the help - I agree with you that my lack of the UTF-8 language pack is the likely culprit here, but I also can't confirm that, unfortunately. – Edward Thomson Oct 31 '12 at 01:26