21

I have a small shellscript in .x

$ cat .x
u="Böhmáí"
touch "$u"
ls > .list
echo "$u" >.text

cat .list .text
diff .list .text
od -bc .list
od -bc .text

When i run this scrpit sh -x .x (-x only for showing commands)

$ sh -x .x
+ u=Böhmáí
+ touch Böhmáí
+ ls
+ echo Böhmáí
+ cat .list .text
Böhmáí
Böhmáí
+ diff .list .text
1c1
< Böhmáí
---
> Böhmáí
+ od -bc .list
0000000   102 157 314 210 150 155 141 314 201 151 314 201 012            
           B   o   ̈    **   h   m   a   ́    **   i   ́    **  \n            
0000015
+ od -bc .text
0000000   102 303 266 150 155 303 241 303 255 012                        
           B   ö  **   h   m   á  **   í  **  \n                        
0000012

The same string Böhmáí has encoded into different bytes in the filename vs as a content of a file. In the terminal (utf8-encoded) the string looks same in both variants.

Where is the rabbit?

clt60
  • 62,119
  • 17
  • 107
  • 194
  • Similar question: http://stackoverflow.com/questions/12147410/different-utf-8-sigature-for-same-diacritics-umlauts-2-binary-ways-to-write – SimonSimCity Aug 27 '12 at 18:29

1 Answers1

43

(This is mostly stolen from a previous answer of mine...)

Unicode allows some accented characters to be represented in several different ways: as a "code point" representing the accented character, or as a series of code points representing the unaccented version of the character, followed by the accent(s). For example, "ä" could be represented either precomposed as U+00E4 (UTF-8 0xc3a4, Latin small letter 1 with diaeresis) or decomposed as U+0061 U+0308 (UTF-8 0x61cc88, Latin small letter a + combining diaeresis).

OS X's HFS+ filesystem requires that all filenames be stored in the UTF-8 representation of their fully decomposed form. In an HFS+ filename, "ä" MUST be encoded as 0x61cc88, and "ö" MUST be encoded as 0x6fcc88.

So what's happening here is that your shell script contains "Böhmáí" in precomposed form, so it gets stored that way in the variable a, and stored that way in the .text file. But when you create a file with that name (with touch), the filesystem converts it to the decomposed form for the actual filename. And when you ls it, it shows the form the filesystem has: the decomposed form.

Community
  • 1
  • 1
Gordon Davisson
  • 118,432
  • 16
  • 123
  • 151
  • 3
    HFS+ actually uses a variant of NFD (normalization form decomposed) where some ranges of characters are precomposed, so filenames are not fully decomposed. See [Text Encodings in VFS](http://developer.apple.com/library/mac/#qa/qa1173/_index.html). – Lri Jun 19 '13 at 19:07
  • Thanks for this, very helpful. Would this HFS+ filesystem requirement to use NFD also affect other filesystems, for example, a FAT32 SD card mounted on a Mac? – LarsH Jun 17 '16 at 04:21
  • 1
    @LarsH: I just tested creating a file on FAT32 from OS X (v10.11.3), and it did convert to the NFD form. I do not know if other platforms would do the same, nor how a file stored with some other form of name would show up on OS X. – Gordon Davisson Jun 18 '16 at 07:22
  • Thanks for testing that and reporting results. After I posted the above comment, I discovered that the names of directories I had on a FAT32 microSD card were in precomposed (NFC, I assume) form, contrary to what MacOS X requires. The files & directories on the microSD card were created by copying them over a USB cable to an Android device, either via a mounted volume in Finder, or via Android File Transfer -- I don't remember which. – LarsH Jun 20 '16 at 14:12
  • I cannot reproduce this on my MacOS Catalina Version 10.15.6. `touch` and consecutive `ls` have NFC strings. OK, I used copy & paste, but usually c&p does not normalise and `uni identify` is reliable. – Helmut Wollmersdorfer Mar 06 '22 at 20:28