Unicode encoding for filesystem in Mac OS X not correct in Python?

Question

Having a bit of struggle with Unicode file names in OS X and Python. I am trying to use filenames as input for a regular expression later in the code, but the encoding used in the filenames seem to be different from what sys.getfilesystemencoding() tells me. Take the following code:

#!/usr/bin/env python
# coding=utf-8

import sys,os
print sys.getfilesystemencoding()

p = u'/temp/s/'
s = u'åäö'
print 's', [ord(c) for c in s], s
s2 = s.encode(sys.getfilesystemencoding())
print 's2', [ord(c) for c in s2], s2
os.mkdir(p+s)
for d in os.listdir(p):
  print 'dir', [ord(c) for c in d], d

It outputs the following:

utf-8
s [229, 228, 246] åäö
s2 [195, 165, 195, 164, 195, 182] åäö
dir [97, 778, 97, 776, 111, 776] åäö

So, file system encoding is utf-8, but when I encode my filename åäö using that, it will not be the same as if I create a dir name with the same string. I expect that when I use my string åäö to create a dir, and read it's name back, it should use the same codes as if I applied the encoding directly.

If we look at the code points 97, 778, 97, 776, 111, 776, it's basically ASCII characters with added diacritic, e.g. o + ¨ = ö, which makes it two characters, not one. How can I avoid this discrepancy, is there an encoding scheme in Python that matches this behaviour by OS X, and why is not getfilesystemencoding() giving me the right result?

Or have I messed up?

The problem can be solved for those specific characters, by doing the following regexp on filename strings to get them into diacritic-less unicode:`m_aa = re.compile(ur"a\u0308",re.I), m_ae = re.compile(ur"a\u030a",re.I), m_oe = re.compile(ur"o\u0308",re.I) — RipperDoc, Mar 18 '12 at 11:46

score 26 · Accepted Answer · edited Jun 30 '17 at 09:51

26

MacOS X uses a special kind of decomposed UTF-8 to store filenames. If you need to e.g. read in filenames and write them to a "normal" UTF-8 file, you must normalize them :

filename = unicodedata.normalize('NFC', unicode(filename, 'utf-8')).encode('utf-8')

from here: https://web.archive.org/web/20120423075412/http://boodebr.org/main/python/all-about-python-and-unicode

edited Jun 30 '17 at 09:51

Wilfred Hughes

29,846
15
139
192

answered Mar 18 '12 at 11:44

sigman

1,291
12
13

Ran into this problem with node.js the npm package `unorm` has a really nice interface for this. – mmilleruva Sep 08 '15 at 15:42

一二三 · Answer 2 · 2015-04-27T23:21:50.557

23

getfilesystemencoding() is giving you the correct response (the encoding), but it does not tell you the unicode normalisation form.

In particular, the HFS+ filesystem uses UTF-8 encoding, and a normalisation form close to "D" (which requires composed characters like ö to be decomposed into o¨). HFS+ is also tied to the normalisation form as it existed in Unicode version 3.2—as detailed in Apple's documentation for the HFS+ format.

Python's unicodedata.normalize method converts between forms, and if you prefix the call with the ucd_3_2_0 object, you can constrain it to Unicode version 3.2:

filename = unicodedata.ucd_3_2_0.normalize('NFC', unicode(filename, 'utf-8')).encode('utf-8')

edited Apr 27 '15 at 23:21

answered Mar 18 '12 at 11:45

一二三

21,059
11
65
74

Thanks, great answer, wish I could upvote and accept both answers! – RipperDoc Mar 18 '12 at 14:49
2

Actually, it’s not quite NFD, but it’s close. – tchrist Mar 18 '12 at 15:45
If HFS+ stores filenames in decomposed form, wouldn't you use `normalise('NFD'...)` to match the HFS+ encoding? – Craig McQueen Jan 24 '19 at 00:08

Unicode encoding for filesystem in Mac OS X not correct in Python?

2 Answers2

Linked