12

I'm dealing with code that does various IO operations with files, and I want to make it able to deal with international filenames. I'm working on a Mac with Java 1.5, and if a filename contains Unicode characters that require surrogates, the JVM can't seem to locate the file. For example, my test file is:

"草鷗外.gif" which gets broken into the Java characters \u8349\uD85B\uDFF6\u9DD7\u5916.gif

If I create a file from this filename, I can't open it because I get a FileNotFound exception. Even using this on the folder containing the file will fail:

File[] files = folder.listFiles(); 
for (File file : files) {
    if (!file.exists()) {
        System.out.println("Failed to find File"); //Fails on the surrogate filename
    }
}

Most of the code I am actually dealing with are of the form:

FileInputStream instream = new FileInputStream(new File("草鷗外.gif"));
// operations follow

Is there some way I can address this problem, either escaping the filenames or opening files differently?

Bear
  • 121
  • 1
  • 1
  • 3
  • 1
    What's the value of Charset.defaultCharset() in your environment? – matt b Oct 09 '09 at 19:24
  • 2
    (Unfortunately, StackOverflow also has a problem with surrogates, and has stripped the U+26FF6 ideograph from the question) – bobince Oct 09 '09 at 19:42
  • Can you provide what System.getProperty("file.encoding") returns? Try changing your encoding java -dfile.encoding=ENCODING_GOES_HERE if does nor work change your system locale. If this also does nor work we will wait for an expert to solve it. – JCasso Oct 09 '09 at 20:14
  • The charset and file encoding are both UTF-8 – Bear Oct 27 '09 at 00:00

4 Answers4

7

I suspect one of Java or Mac is using CESU-8 instead of proper UTF-8. Java uses “modified UTF-8” (which is a slight variation of CESU-8) for a variety of internal purposes, but I wasn't aware it could use it as a filesystem/defaultCharset. Unfortunately I have neither Mac nor Java here to test with.

“Modified” is a modified way of saying “badly bugged”. Instead of outputting a four-byte UTF-8 sequence for supplementary (non-BMP) characters like 𦿶:

\xF0\xA6\xBF\xB6

it outputs a UTF-8-encoded sequence for each of the surrogates:

\xED\xA1\x9B\xED\xBF\xB6

This isn't a valid UTF-8 sequence, but a lot of decoders will allow it anyway. Problem is if you round-trip that through a real UTF-8 encoder you've got a different string, the four-byte one above. Try to access the file with that name and boom! fail.

So first let's just check how filenames are actually stored under your current filesystem, using a platform that uses bytes for filenames such as Python 2.x:

$ python
Python 2.x.something (blah blah)
Type "help", "copyright", "credits" or "license" for more information.
>>> import os
>>> os.listdir('.')

On my filesystem (Linux, ext4, UTF-8), the filename “草𦿶鷗外.gif” comes out as:

['\xe8\x8d\x89\xf0\xa6\xbf\xb6\xe9\xb7\x97\xe5\xa4\x96.gif']

which is what you want. If that's what you get, it's probably Java doing it wrong. If you get the longer six-byte-character version:

['\xe8\x8d\x89\xed\xa1\x9b\xed\xbf\xb6\xe9\xb7\x97\xe5\xa4\x96.gif']

it's probably OS X doing it wrong... does it always store filenames like this? (Or did the files come from somewhere else originally?) What if you rename the file to the ‘proper’ version?:

os.rename('\xe8\x8d\x89\xed\xa1\x9b\xed\xbf\xb6\xe9\xb7\x97\xe5\xa4\x96.gif', '\xe8\x8d\x89\xf0\xa6\xbf\xb6\xe9\xb7\x97\xe5\xa4\x96.gif')
bobince
  • 528,062
  • 107
  • 651
  • 834
  • 2
    Not really a bug as it's part of the spec (even if it is often confusing.) – finnw Oct 10 '09 at 00:20
  • The result of the python commands was the proper filename you listed first, so it must be Java not playing nice. – Bear Oct 26 '09 at 18:19
  • Oh, that's unfortunate. Even if you detected the broken-CESU-8 situation, I can't think of any way to work around it and get a byte-oriented filename interface. :-( You might have to explicitly disallow the surrogates until such time as Sun fix it. How poor. – bobince Oct 26 '09 at 18:37
5

If your environment's default locale does not include those characters you cannot open the file.

See: File.exists() fails with unicode characters in name

Edit: Alright.. What you need is to change the system locale. Whatever OS you are using.

Edit:

See: How can I open files containing accents in Java?

See: JFileChooser on Mac cannot see files named by Chinese chars?

Community
  • 1
  • 1
JCasso
  • 5,423
  • 2
  • 28
  • 42
  • Is it not possible to do this without changing the system locale? The program I am building will need to run on any locale, and I should be able to input these characters and deal with these files even in a US/English locale. – Bear Oct 26 '09 at 18:22
  • Bad solution - because app runned on users, wich not sitting on my computer. And have different locale, and they do not have rigth administrator to do this. – Dmitry Nelepov Jun 08 '13 at 16:01
  • AFAIK there is no other solution. This limitation comes with Sun/Oracle Java. You can try JFileChooser if displaying a save dialog to your users is OK for you. – JCasso Jun 10 '13 at 06:59
3

This turned out to be a problem with the Mac JVM (tested on 1.5 and 1.6). Filenames containing supplementary characters / surrogate pairs cannot be accessed with the Java File class. I ended up writing a JNI library with Carbon calls for the Mac version of the project (ick). I suspect the CESU-8 issue bobince mentioned, as the JNI call to get UTF-8 characters returned a CESU-8 string. Doesn't look like it's something you can really get around.

Bear
  • 201
  • 1
  • 3
  • 7
0

It's a bug in the old-skool java File api, maybe just on a mac? Anyway, the new java.nio api works much better. I have several files containing unicode characters and content that failed to load using java.io.File and related classes. After converting all my code to use java.nio.Path EVERYTHING started working. And I replaced org.apache.commons.io.FileUtils (which has the same problem) with java.nio.Files...

...and be sure to read and write the content of file using an appropriate charset, for example: Files.readAllLines(myPath, StandardCharsets.UTF_8)

pomo
  • 2,251
  • 1
  • 21
  • 34