java reads file system file names differently on osx and linux

Question

I have a java program that is almost working perfectly. I'm developing on a mac and pushing to linux for production. When the mac searches the file system and inserts new file names to the database it works great. However, when I push to the linux box and do the search/insert it finds files with some characters as different IE: Béla Fleck. They look identical to me in the database and on the mac AND linux file systems. In fact, the mac and linux boxes have NFS mounts to a 3rd system (linux) where the files reside.

I've dumped the bytes and can see how linux and mac are seeing the string from the file system: Béla Fleck.

linux:

utf8bytes[0] = 0x42
utf8bytes[1] = 0x65
utf8bytes[2] = 0xcc
utf8bytes[3] = 0x81
utf8bytes[4] = 0x6c
utf8bytes[5] = 0x61
utf8bytes[6] = 0x20
utf8bytes[7] = 0x46
utf8bytes[8] = 0x6c
utf8bytes[9] = 0x65
utf8bytes[10] = 0x63
utf8bytes[11] = 0x6b

linux says LANG=en_US.UTF-8

mac:

utf8Bytes[0] = 0x42
utf8Bytes[1] = 0xc3
utf8Bytes[2] = 0xa9
utf8Bytes[3] = 0x6c
utf8Bytes[4] = 0x61
utf8Bytes[5] = 0x20
utf8Bytes[6] = 0x46
utf8Bytes[7] = 0x6c
utf8Bytes[8] = 0x65
utf8Bytes[9] = 0x63
utf8Bytes[10] = 0x6b

mac says LANG=en_US.UTF-8

tried this, still no joy.

java -Dfile.encoding=UTF-8

I'm using java nio file to get the directory:

java.nio.file.Path path = Paths.get("test");

then walking the path with

Files.walkFileTree(path, new SimpleFileVisitor<Path>() {

and then, since this is a subdir in the test path:

 file.getParent().getName(1).toString()

Anyone have any ideas on what is glitching here and how I can fix this?

Thanks.

They look identical because the Mac file name contains a single accented ‘e’ character (`é`), whereas the Linux file name contains a plain ‘e’ character followed by a combining accent (`e ˊ`). Visually, they look identical, and most [Collators](http://docs.oracle.com/javase/8/docs/api/java/text/Collator.html) would consider them identical. As for why they’re different, it’s hard to tell without seeing the code that obtains/creates the file names. — VGR, Dec 02 '16 at 15:13

score 3 · Accepted Answer · edited May 23 '17 at 12:02

3

Some searching revealed that OS X always decomposes file names:

This suggests to me that you may have accidentally switched the outputs: the first byte array is decomposed, so I’m guessing it was taken from a Mac, whereas the second one is from Linux.

In any event, if you want them to be identical for all systems, you can do the decomposition yourself:

String name = file.getParent().getName(1).toString();
name = Normalizer.normalize(name, Normalizer.Form.NFD);

edited May 23 '17 at 12:02

Community

1
1

answered Dec 02 '16 at 16:10

VGR

40,506
4
48
63

Time for me to learn what that Normalizer of which you speak is doing. It worked as you stated. – phomlish Dec 02 '16 at 16:58
1

The [documentation for Normalizer](http://docs.oracle.com/javase/8/docs/api/java/text/Normalizer.html) contains a link to the Unicode specification which lays out the concept of normalization. – VGR Dec 02 '16 at 18:09

score 1 · Answer 2 · answered Dec 02 '16 at 17:28

(Not really an answer, just more discussion.)

Those seem to be utf8 characters, but formed in different ways.

c4a9 is é -- This is normally how one would enter an accented letter.

However, it is possible to use a pair of characters:

65cc91 is ȇ, but formed as a combination of e and a "COMBINING INVERTED BREVE". c3aa is the single character ê

Some COLLATIONs can compensate for the differences, but it is up to the application to combine them at key-stroke time.

SELECT CAST(UNHEX('65cc91') AS CHAR) =
       CAST(UNHEX('c3aa') AS CHAR) COLLATE utf8_unicode_520_ci;  --> 1

java reads file system file names differently on osx and linux

2 Answers2