UTF-8 and os.listdir()

Question

I'm having a bit of trouble with a file containing the "ș" character (that's \xC8\x99 in UTF-8 - LATIN SMALL LETTER S WITH COMMA BELOW).

I'm creating a ș.txt file and trying to get it back with os.listdir(). Unfortunately, os.listdir() returns it back as s\xCC\xA6 ("s" + COMBINING COMMA BELOW) and my test program (below) fails.

This happens on my OS X, but it works on a Linux machine. Any idea what exactly causes this behavior (both environments are configured with LANG=en_US.UTF8) ?

Here's the test program:

#coding: utf-8
import os

fname = "ș.txt"
with open(fname, "w") as f:
    f.write("hi")

files = os.listdir(".")
print "fname: ", fname
print "files: ", files

if fname in files:
    print "found"
else:
    print "not found"

Martijn Pieters · Accepted Answer · 2014-11-04T11:35:01.733

9

The OS X filesystem mostly uses decomposed characters rather than their combined form. You'll need to normalise the filenames back to the NFC combined normalised form:

import unicodedata
files = [unicodedata.normalize('NFC', f) for f in os.listdir(u'.')]

This processes filenames as unicode; you'd otherwise need to decode the bytestring to unicode first.

Also see the unicodedata.normalize() function documentation.

edited Nov 04 '14 at 11:35

answered Nov 04 '14 at 10:40

Martijn Pieters

1,048,767
296
4,058
3,343

Thanks for the link, I understand what's going on now. Your code is not working btw, I need to do `u"ș.txt" in [unicodedate.normalize('NFC', f) for f in os.listdir(u'.')]` instead. – Unknown Nov 04 '14 at 11:03
@Unknown: right, or decode and again encode. But using a unicode path is better. – Martijn Pieters Nov 04 '14 at 11:04
@Unknown how can you do that? I'm facing with that problem tooo – Nam Pham May 19 '16 at 11:18
@NamPham: do what exactly, what problem are you facing? The `files` list will contain a list of Unicode string objects, each normalised. – Martijn Pieters May 19 '16 at 11:20
I'm faceing about decoding and encoding process, I can't put `u'.'` as an argument for listdir. My path is unicode :( – Nam Pham May 19 '16 at 15:34
@NamPham: all you need is a `unicode` object. If your variable is a `str` object, you'll first need to decode. – Martijn Pieters May 19 '16 at 17:16
@MartijnPieters Thank you. – Nam Pham May 20 '16 at 04:12
Would the same go for Windows 10? – Superdooperhero Jun 08 '21 at 05:31
@Superdooperhero no. Windows does not perform any normalisation on pathnames. See https://stackoverflow.com/questions/7041013/unicode-normalization-in-windows. – Martijn Pieters Jun 08 '21 at 07:20

UTF-8 and os.listdir()

1 Answers1

Linked