2

I'm having a bit of trouble with a file containing the "ș" character (that's \xC8\x99 in UTF-8 - LATIN SMALL LETTER S WITH COMMA BELOW).

I'm creating a ș.txt file and trying to get it back with os.listdir(). Unfortunately, os.listdir() returns it back as s\xCC\xA6 ("s" + COMBINING COMMA BELOW) and my test program (below) fails.

This happens on my OS X, but it works on a Linux machine. Any idea what exactly causes this behavior (both environments are configured with LANG=en_US.UTF8) ?

Here's the test program:

#coding: utf-8
import os

fname = "ș.txt"
with open(fname, "w") as f:
    f.write("hi")

files = os.listdir(".")
print "fname: ", fname
print "files: ", files

if fname in files:
    print "found"
else:
    print "not found"
Unknown
  • 5,722
  • 5
  • 43
  • 64

1 Answers1

9

The OS X filesystem mostly uses decomposed characters rather than their combined form. You'll need to normalise the filenames back to the NFC combined normalised form:

import unicodedata
files = [unicodedata.normalize('NFC', f) for f in os.listdir(u'.')]

This processes filenames as unicode; you'd otherwise need to decode the bytestring to unicode first.

Also see the unicodedata.normalize() function documentation.

Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343