11

I have this problem that has been dropped on me, and have been a couple of days of unsuccessful searches and workaround attempts.

I have now an internal java swing program distributed by jnlp/webstart, on osx and windows computers, that, among other things, downloads some files from WebDav.

Recently, on a test machine with OSX 10.8 and Java 7, filenames and directory names with accented characters started having those replaced by question marks.

No problem on OSX with versions of Java before 7.

example :

XXXYYY_è_ABCD/

becomes

XXXYYY_?_ABCD/

using java.text.Normalizer (NFD, NFC, NFKD, NFKC) on the original string, the result is different but still wrong :

XXXYYY_e?_ABCD/

or

XXXYYY_e_ABCD/

I know, from correspondence between [andrew.brygin at oracle.com] and [mik3hall at gmail.com] that

Yes, file.encoding is set based on the locale that the jvm is running on, and if you run your java vm in xxxx.UTF-8 locale, the file.encoding should be UTF-8, set to MacRoman will be problematic. So I believe Oracle/OpenJDK7 behaves correctly. That said, as Andrew Thompson pointed out, if all previous Apple JDK releases use MacRoman as the file.encoding for english/UTF-8 locale, there is a "compatibility" concern here, it might worth putting something in the release note to give Oracle/OpenJDK MacOS user a heads up.

original mail

from Joni Salonen blog (java-and-file-names-with-invalid-characters) i know that :

You probably know that Java uses a “default character encoding” to convert binary data to Strings. To read or write text using another encoding you can use an InputStreamReader or OutputStreamWriter. But for data-to-text conversions deep in the API you have no choice but to change the default encoding.

and

What about file.encoding?

The file.encoding system property can also be used to set the default character encoding that Java uses for I/O. Unfortunately it seems to have no effect on how file names are decoded into Strings.

executing locale from inside the jnlp invariabily prints

LANG=
LC_COLLATE="C"
LC_CTYPE="C"
LC_MESSAGES="C"
LC_MONETARY="C"
LC_NUMERIC="C"
LC_TIME="C"
LC_ALL=

the most similar problem on stackoverflow with a solution is this : encoding-issues-on-java-7-file-names-in-os-x

but the solution is wrapping the execution of the java program in a script with

#!/bin/bash
export LC_CTYPE="UTF-8" # Try other options if this doesn't work
exec java your.program.Here

but I don't think this option is available to me because of the webstart, and I haven't found any way to set the LC_CTYPE environment variable from within the program.

Any solutions or workarounds?

P.S. :

If we run the program directly from shell, it writes the file/directory correctly even on OSX 10+Java 7. The problem appears only with the combination of JNLP+OSX+Java7

Community
  • 1
  • 1
Duralumin
  • 173
  • 2
  • 11
  • It has been suggested to me to use the jnlp properties to set the system properties, in the same way it has been done here : http://stackoverflow.com/questions/5887351/java-applet-via-jnlp-system-properties-not-being-set but I'm under the impression (excuse my general ignorance on jnlp related matters) that those properties aren't going to influence the environment variables like LC_CTYPE. Is that right? – Duralumin Nov 29 '12 at 11:29
  • Do you have any methods in your code that use the default charset (see [this list for example](http://svn.apache.org/viewvc/lucene/dev/trunk/lucene/tools/forbiddenApis/jdk.txt?revision=1412598&view=markup))? – assylias Dec 05 '12 at 18:43
  • I did a search, there are some (like toLowerCase) that are used around the code, but not around the problematic functionality. Why? – Duralumin Dec 06 '12 at 09:02
  • Did you find a problem to this solution? – Fotis Paraskevopoulos Mar 12 '13 at 17:33
  • @Fotis not yet. Our system administrator sent a bug request to Oracle. I think. Hope. Still waiting. – Duralumin Mar 13 '13 at 08:37

5 Answers5

5

I take it it's acceptable to have maximal ASCII representation of the file name, which works in virtually any encoding.

First, you want to use specifically NFKD, so that maximum information is retained in the ASCII form. For example, "2⁵" becomes "25"rather than just "2", "fi" becomes "fi" rather than "" etc once the non-ascii and non-control characters are filtered out.

String str = "XXXYYY_è_ABCD/";
str = Normalizer.normalize(str, Normalizer.Form.NFKD);
str = str.replaceAll( "[^\\x20-\\x7E]", "");
//The file name will be XXXYYY_e_ABCD no matter what system encoding

You would then always pass filenames through this filter to get their filesystem name. You only lose is some uniqueness, I.E file asdé.txt is the same as asde.txt and in this system they cannot be differentiated.

Esailija
  • 138,174
  • 23
  • 272
  • 326
  • Unfortunately, it's not. I too proposed stripping or replacing accented and special characters from the file and directory names; in the end they are there only to provide a readable reference on the unique code (the code is written as part of the directory structure), and they are prone to problems (as in this case). But no, it's not acceptable. – Duralumin Dec 10 '12 at 13:32
  • @Duralumin ok sorry then, I read too much into your normalization attempt. As in since you tried to do this I thought it would be ok. – Esailija Dec 10 '12 at 13:34
  • Not your fault, thank you anyway. If only the solution had been something as sensible as that... =) – Duralumin Dec 10 '12 at 13:38
  • @Duralumin what about using uri encoding for the filenames and abstracting the fact away? – Esailija Dec 10 '12 at 17:22
  • Can't do. Not my choice. =/ – Duralumin Dec 11 '12 at 08:26
1

EDIT: After experimenting with OS X some more I realized my answer was totally wrong, so I'm redoing it.

If your JVM supports -Dfile.encoding=UTF-8 on the JVM command line, that might fix the issue. I believe that is a standard property but I'm not certain about that.

HFS Plus, like other POSIX-compliant file systems, stores filenames as bytes. But unlike Linux's ext3 filesystem, it forces filenames to be valid decomposed UTF-8. This can be seen here with the Python interpreter on my OS X system, starting in an empty directory.

$ python
Python 2.7.1 (r271:86832, Jul 31 2011, 19:30:53) 
>>> import os
>>> os.mkdir('\xc3\xa8')
>>> os.mkdir('e\xcc\x80')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
OSError: [Errno 17] File exists: 'e\xcc\x80'
>>> os.mkdir('\x8f')
>>> os.listdir('.')
['%8F', 'e\xcc\x80']
>>> ^D
$ ls
%8F è

This proves that the directory name on your filesystem cannot be Mac-Roman encoded (i.e. with byte value 8F where the è is seen), as long as it's an HFS Plus filesystem. But of course, the JVM is not assured of an HFS Plus filesystem, and SMB and NFS do not have the same encoding guarantees, so the JVM should not assume this scheme.

Therefore, you have to convince the JVM to interpret file and directory names with UTF-8 encoding, in order to read the names as java.lang.String objects correctly.

wberry
  • 18,519
  • 8
  • 53
  • 85
  • Forcing -Dfile.encoding was about the first thing we tried, because it usually was the solution for this type of problems. But not in this case. It's confirmed in Joni Salonen blog: "The file.encoding system property can also be used to set the default character encoding that Java uses for I/O. Unfortunately it seems to have no effect on how file names are decoded into Strings." – Duralumin Dec 10 '12 at 09:13
  • OK, so far I have only done what you have done. It's tempting to go to the JVM source at this point rather than mess with `LC_` variables. – wberry Dec 10 '12 at 21:48
  • Yep, it's one or the other, in the end. I'm already trying to push a bug report, but going through the official channels here for the request, and then waiting for a fix from Oracle is going to take quite a lot of time. Was hoping for a workaround of some kind. – Duralumin Dec 11 '12 at 08:30
1

Shot in the dark: File Encoding does not influence the way how the file names are created, just how the content gets written into the file - check this guy here: http://jonisalonen.com/2012/java-and-file-names-with-invalid-characters/

Here is a short entry from Apple: http://developer.apple.com/library/mac/#qa/qa1173/_index.html

Comparing this to http://docs.oracle.com/javase/tutorial/i18n/text/normalizerapi.html I would assume you want to use

normalized_string = Normalizer.normalize(target_chars, Normalizer.Form.NFD);

to normalize the file names before you pass them to the File constructor. Does this help?

Stefan
  • 990
  • 1
  • 6
  • 10
  • Please, read all the question. In the question i cited two extracts from the that blogpost of Joni Salonen, and I said we already used java.text.Normalizer. – Duralumin Dec 10 '12 at 09:58
  • Please read all of my answer :) There are different normalizing strategies (NFD, NFC, NFKD, NFKC), which are dealt with differently by the Apple OS. My suggestion was to experiment with the different normalization mechanisms. – Stefan Dec 10 '12 at 10:12
  • You're right, I'm sorry. I didn't expressely specified that, but with Normalizer we tried all the four normalizing strategies. But from the beginning I didn't think that would make a difference, because, as Joni Salonen said, the only thing that influences the filenames is the locale of the filesystem. – Duralumin Dec 10 '12 at 10:24
  • 1
    Bummer, I was not sure either this would work. What is the source of your file name? Is this stored in the .java file, or are you reading it from somewhere? These guys had some similar issues: http://stackoverflow.com/questions/3610013/file-listfiles-mangles-unicode-names-with-jdk-6-unicode-normalization-issues – Stefan Dec 10 '12 at 10:37
  • Didn't saw that question before. Yeah, the problem seems similar, but it seems that their problem is with reading, while they seem to write the filename correctly (we can write the filename correctly in java6 too). The filename is read/processed from a database. – Duralumin Dec 10 '12 at 11:00
  • If you want to I can keep on "guessing" and "poking" but to be honest I am not sure how I can hel. If you have some reproducer code I can run it on my macs. I have some Snow Leopards, Lions and Mountain Lions at home running on various Regions (mostly American Language and German Region) I could run your reproducer on. – Stefan Dec 10 '12 at 11:04
  • The Code is simply a File file = new File(path);file.mkdirs(). What's before doesn't seem to make any difference. The point is you must come from jnlp with java 7 on osx. We have various test machines, and this combination is the only one giving problems. – Duralumin Dec 10 '12 at 11:35
  • duh its been a while, since I last setup a JNLP, if I have time I'll give it a shot tonight. – Stefan Dec 10 '12 at 12:02
  • let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/20878/discussion-between-duralumin-and-stefan) – Duralumin Dec 10 '12 at 13:07
0

I don't think there is a real solution to this problem, right now.

Meantime I came to the conclusion that the "C" environment variables printed from inside the program are from the Java Web Start sandbox, and (by design, apparently) you can't influence those using the jnlp.

The accepted (as accepted by the company) workaround/compromise was of launching the jnlp using javaws from a bash script.

Apparently, launching the jnlp from browser or from finder creates a new sandbox environment with the LANG not setted (so is setted to "C" that is equal to ASCII). Launching the jnlp from command line instead prints the right LANG from the system default, inheriting it from the shell.

This permits to at least preserve the autoupdating feature of the jnlp and dependencies.

Anyway, we sent a bug report to Oracle, but personally I'm not hoping it to be resolved anytime soon, if ever.

Duralumin
  • 173
  • 2
  • 11
0

It's a bug in the old-skool java File api, maybe just on a mac? Anyway, the new java.nio api works much better. I have several files containing unicode characters and content that failed to load using java.io.File and related classes. After converting all my code to use java.nio.Path EVERYTHING started working. And I replaced org.apache.commons.io.FileUtils (which has the same problem) with java.nio.Files...

...and be sure to read and write the content of file using an appropriate charset, for example:

Files.readAllLines(myPath, StandardCharsets.UTF_8)
TLama
  • 75,147
  • 17
  • 214
  • 392
pomo
  • 2,251
  • 1
  • 21
  • 34