1

I have a java program that is almost working perfectly. I'm developing on a mac and pushing to linux for production. When the mac searches the file system and inserts new file names to the database it works great. However, when I push to the linux box and do the search/insert it finds files with some characters as different IE: Béla Fleck. They look identical to me in the database and on the mac AND linux file systems. In fact, the mac and linux boxes have NFS mounts to a 3rd system (linux) where the files reside.

I've dumped the bytes and can see how linux and mac are seeing the string from the file system: Béla Fleck.

linux:

utf8bytes[0] = 0x42
utf8bytes[1] = 0x65
utf8bytes[2] = 0xcc
utf8bytes[3] = 0x81
utf8bytes[4] = 0x6c
utf8bytes[5] = 0x61
utf8bytes[6] = 0x20
utf8bytes[7] = 0x46
utf8bytes[8] = 0x6c
utf8bytes[9] = 0x65
utf8bytes[10] = 0x63
utf8bytes[11] = 0x6b

linux says LANG=en_US.UTF-8

mac:

utf8Bytes[0] = 0x42
utf8Bytes[1] = 0xc3
utf8Bytes[2] = 0xa9
utf8Bytes[3] = 0x6c
utf8Bytes[4] = 0x61
utf8Bytes[5] = 0x20
utf8Bytes[6] = 0x46
utf8Bytes[7] = 0x6c
utf8Bytes[8] = 0x65
utf8Bytes[9] = 0x63
utf8Bytes[10] = 0x6b

mac says LANG=en_US.UTF-8

tried this, still no joy.

java -Dfile.encoding=UTF-8

I'm using java nio file to get the directory:

java.nio.file.Path path = Paths.get("test");

then walking the path with

Files.walkFileTree(path, new SimpleFileVisitor<Path>() {

and then, since this is a subdir in the test path:

 file.getParent().getName(1).toString()

Anyone have any ideas on what is glitching here and how I can fix this?

Thanks.

phomlish
  • 189
  • 1
  • 2
  • 13
  • They look identical because the Mac file name contains a single accented ‘e’ character (`é`), whereas the Linux file name contains a plain ‘e’ character followed by a combining accent (`e ˊ`). Visually, they look identical, and most [Collators](http://docs.oracle.com/javase/8/docs/api/java/text/Collator.html) would consider them identical. As for why they’re different, it’s hard to tell without seeing the code that obtains/creates the file names. – VGR Dec 02 '16 at 15:13
  • added the java.nio.file calls – phomlish Dec 02 '16 at 15:43

2 Answers2

3

Some searching revealed that OS X always decomposes file names:

This suggests to me that you may have accidentally switched the outputs: the first byte array is decomposed, so I’m guessing it was taken from a Mac, whereas the second one is from Linux.

In any event, if you want them to be identical for all systems, you can do the decomposition yourself:

String name = file.getParent().getName(1).toString();
name = Normalizer.normalize(name, Normalizer.Form.NFD);
Community
  • 1
  • 1
VGR
  • 40,506
  • 4
  • 48
  • 63
  • Time for me to learn what that Normalizer of which you speak is doing. It worked as you stated. – phomlish Dec 02 '16 at 16:58
  • 1
    The [documentation for Normalizer](http://docs.oracle.com/javase/8/docs/api/java/text/Normalizer.html) contains a link to the Unicode specification which lays out the concept of normalization. – VGR Dec 02 '16 at 18:09
1

(Not really an answer, just more discussion.)

Those seem to be utf8 characters, but formed in different ways.

c4a9 is é -- This is normally how one would enter an accented letter.

However, it is possible to use a pair of characters:

65cc91 is , but formed as a combination of e and a "COMBINING INVERTED BREVE". c3aa is the single character ê

Some COLLATIONs can compensate for the differences, but it is up to the application to combine them at key-stroke time.

SELECT CAST(UNHEX('65cc91') AS CHAR) =
       CAST(UNHEX('c3aa') AS CHAR) COLLATE utf8_unicode_520_ci;  --> 1
Rick James
  • 135,179
  • 13
  • 127
  • 222