5

Trying to open a file it states it cannot be found, due to a charset mismatch, when file names have accents. I work using UTF-8 on a linux system (/etc/locales sets UTF-8 as well). Running jboss with -Dfile.encoding=UTF-8 and environment variable JBOSS_ENCODING="UTF-8"

With a JSP I am getting the name of the file :

String fileName = element.getChildText("FileName");
out.println("File to be opened : " + filename);

Displays :

File to be opened : aaaaaà.txt

But, a new File(fileName) won't work. Just file.exists() is false.

Trying to:

File[] files = dir.listFiles();
for (int i=0; i<files.length; i++){
      out.println(fileName);

I get : aaaaaà .txt

Why is it reading and trying to open the file taking of the file in HDD as ISO-8859-1? Is it a JBoss config? A java config? How can I force java.io.File to read the file using the UTF-8 as the charset of the file name?

I've used other tools and the name is always read fine, using UTF-8.

(note I'm always talking about the name of the file, never the content, it could be a void file)

Llistes Sugra
  • 991
  • 4
  • 9
  • 24
  • `-Dfile.encoding=UTF-8` is Sun/Oracle JVM specific. What JVM are you using? Even then, you should after all not be using this argument at all. – BalusC Sep 30 '10 at 16:33
  • @BalusC: I'm not sure what you mean by that. The "-Dfile.encoding" tag is also supported by at least the IBM JVM (I'm not sure how many other JVM's are in serious use today). – Steve Perkins Sep 30 '10 at 17:15
  • JVM is Java Hotspot, anyway, so it fits with the comment – Llistes Sugra Sep 30 '10 at 18:12
  • I tried the same on Linux and also failed. Java couldn't get the file names properly although I tried all combinations of `LANG`, `LC_ALL`, `file.encoding` and `sun.jnu.encoding`, without success. Any more ideas? – Roland Illig Sep 30 '10 at 21:28
  • No more ideas. It seems the way is to poll the charsets and try everyone. – Llistes Sugra Oct 07 '10 at 14:38

2 Answers2

3

I am trying to track down the problem. Here is what I already have:

There is Exists.java:

import java.io.*;

public class Exists {
  public static void main(String[] args) {
    new File("aaa").exists();
    new File("aaa\u00E4").exists();
    new File("aaa\u00C3\u00A4").exists();
  }
}

And there is java -version:

java version "1.6.0_20"
Java(TM) SE Runtime Environment (build 1.6.0_20-b02)
Java HotSpot(TM) 64-Bit Server VM (build 16.3-b01, mixed mode)

Now to the interesting part:

$ strace -f -o strace.out java Exists && grep 'stat("aaa' strace.out
31942 stat("aaa", 0x41464950)           = -1 ENOENT (No such file or directory)
31942 stat("aaa\303\244", 0x41464950)   = -1 ENOENT (No such file or directory)
31942 stat("aaa\303\203\302\244", 0x41464950) = -1 ENOENT (No such file or directory)

The nice thing is that strace works on byte-level, not character-level like Java. So everything is ok in this case. I have the environment variable LANG set to en_US.UTF-8, all of the LC_* variables are unset.

Now tracking down the problem to a minimal working example:

$ strace -f -o strace.out env - LC_ALL=en_US.UTF-8 /home/roland/bin/java Exists && grep 'stat("aaa' strace.out
31968 stat("aaa", 0x41a75950)           = -1 ENOENT (No such file or directory)
31968 stat("aaa\303\244", 0x41a75950)   = -1 ENOENT (No such file or directory)
31968 stat("aaa\303\203\302\244", 0x41a75950) = -1 ENOENT (No such file or directory)

That still works. So let's try another encoding:

$ strace -f -o strace.out env - LANG=en_US.ISO-8859-1 /home/roland/bin/java Exists && grep 'stat("aaa' strace.out
32070 stat("aaa", 0x407a3950)           = -1 ENOENT (No such file or directory)
32070 stat("aaa?", 0x407a3950)          = -1 ENOENT (No such file or directory)
32070 stat("aaa??", 0x407a3950)         = -1 ENOENT (No such file or directory)

So this doesn't work. One possible reason might be that I selected a locale that is not in the list printed by locale -a. But this shouldn't be the reason for Java to convert the letters to question marks.

As soon as LANG points to a non-existing locale, the setting of the sun.jnu.encoding property doesn't have any effect anymore. So I'm out of ideas now.

Roland Illig
  • 40,703
  • 10
  • 88
  • 121
  • Question mark is supposed to be displayed when trying to encode an ISO with UTF-8. It seems you are doing the opposite, so it should write something like "÷". I guess this is a console issue consisting in writing in UTF (again) something strace converted to ISO. – Llistes Sugra Oct 01 '10 at 10:52
  • No, it isn't. Why should the UTF-8 bytes be displayed as octal escapes and the latin1 ones not? As I said, `strace` works on byte-level. Otherwise it would be useless for binary data. – Roland Illig Oct 01 '10 at 23:33
  • `"aaa\u00C3\u00A4"` does not mean what you think it means. It represents five characters, not five bytes. The filename is only four characters long. `"aaa\u00E4"` is correct. – Christoffer Hammarström Apr 13 '12 at 12:58
  • 1
    I chose the `"aaa\u00C3\u00A4"` example deliberately, and I know that it represents the string `aaaä`. I chose it so that it might have been translated to `aaaä` in the test case where I had set `LC_ALL=en_US.ISO-8859-1`. – Roland Illig Apr 13 '12 at 22:43
1

Try this:

Java Can't Open a File with Surrogate Unicode Values in the Filename?

Community
  • 1
  • 1
Steve Perkins
  • 11,520
  • 19
  • 63
  • 95