Convert python filenames to unicode

Question

I am on python 2.6 for Windows.

I use os.walk to read a file tree. Files may have non-7-bit characters (German "ae" for example) in their filenames. These are encoded in Pythons internal string representation.

I am processing these filenames with Python library functions and that fails due to wrong encoding.

How can I convert these filenames to proper (unicode?) python strings?

I have a file "d:\utest\ü.txt". Passing the path as unicode does not work:

>>> list(os.walk('d:\\utest'))
[('d:\\utest', [], ['\xfc.txt'])]
>>> list(os.walk(u'd:\\utest'))
[(u'd:\\utest', [], [u'\xfc.txt'])]

It DOES work: Look at your output!! Both the directory name u'd:\\utest' and the file name u'\xfc.txt' are presented as unicode objects u'blahblah' instead of the previous str objects 'blahblah'. Perhaps the fact that the u-with-umlaut is represented as \xfc is boggling you but that's the same with str as with unicode and is nothing to do with the str/unicode problem. — John Machin, Jun 27 '09 at 10:42
Perhaps you need to amplify "fails due to wrong encoding" ... What fails? How? Show the full traceback and error message. — John Machin, Jun 27 '09 at 12:07

RichieHindle · Accepted Answer · 2009-06-27T13:05:41.057

47

If you pass a Unicode string to os.walk(), you'll get Unicode results:

>>> list(os.walk(r'C:\example'))          # Passing an ASCII string
[('C:\\example', [], ['file.txt'])]
>>> 
>>> list(os.walk(ur'C:\example'))        # Passing a Unicode string
[(u'C:\\example', [], [u'file.txt'])]

edited Jun 27 '09 at 13:05

answered Jun 27 '09 at 06:34

RichieHindle

272,464
47
358
399

except that if a filename is undecodable then [you can get a bytestring instead of Unicode in Python 2](http://stackoverflow.com/a/22314324/4279) – jfs May 17 '15 at 19:21

Shourya Sarcar · Answer 2 · 2011-05-15T13:06:05.330

6

I was looking for a solution for Python 3.0+. Will put it up here incase someone else needs it.

rootdir = r'D:\COUNTRY\ROADS\'
fs_enc = sys.getfilesystemencoding()
for (root, dirname, filename) in os.walk(rootdir.encode(fs_enc)):
    # do your stuff here, but remember that now
    # root, dirname, filename are represented as bytearrays

edited May 15 '11 at 13:06

answered May 15 '11 at 07:48

Shourya Sarcar

61
1
2

@ramdaz 1. As the syntax highlighter demonstrates: you should not use odd slash before the closing quote. It is a `SyntaxError` in Python. 2. The code in the answer passed `bytes` to `os.walk()` and therefore the filenames are produced as bytes. You should use Unicode (drop `.encode()` call) instead. OP is confused. Passing Unicode is the right thing on Windows. – jfs May 17 '15 at 19:19
Seems to have stopped working with Python 3.5: `TypeError: os.scandir() doesn't support bytes path on Windows, use Unicode instead` – koppor Dec 16 '15 at 17:16

gatoatigrado · Answer 3 · 2010-03-06T15:36:42.053

4

a more direct way might be to try the following -- find your file system's encoding, and then convert that to unicode. for example,

unicode_name = unicode(filename, "utf-8", errors="ignore")

to go the other way,

unicode_name.encode("utf-8")

edited Mar 06 '10 at 15:36

answered Mar 06 '10 at 15:27

gatoatigrado

16,580
18
81
143

score 4 · Answer 4 · edited May 15 '15 at 08:45

4

os.walk(unicode(root_dir, 'utf-8'))

edited May 15 '15 at 08:45

SomethingDark

13,229
5
50
55

answered May 15 '15 at 07:14

Pegasus

1,398
15
20

score 2 · Answer 5 · answered Jun 27 '09 at 06:36

os.walk isn't specified to always use os.listdir, but neither is it listed how Unicode is handled. However, os.listdir does say:

Changed in version 2.3: On Windows NT/2k/XP and Unix, if path is a Unicode object, the result will be a list of Unicode objects. Undecodable filenames will still be returned as string objects.

Does simply using a Unicode argument work for you?

for dirpath, dirnames, filenames in os.walk(u"."):
  print dirpath
  for fn in filenames:
    print "   ", fn

score 1 · Answer 6 · answered Jun 27 '09 at 07:18

1

No, they are not encoded in Pythons internal string representation, there is no such thing. They are encoded in the encoding of the operating system/file system. Passing in unicode works for os.walk though.

I don't know how os.walk behaves when filenames can't be decoded, but I assume that you'll get a string back, like with os.listdir(). In that case you'll again have problems later. Also, not all of Python 2.x standard library will accept unicode parameters properly, so you may need to encode them as strings anyway. So, the problem may in fact be somewhere else, but you'll notice if that is the case. ;-)

If you need more control of the decoding you can't always pass in a string, and then just decode it with filename = filename.decode() as usual.

answered Jun 27 '09 at 07:18

Lennart Regebro

167,292
41
224
251

1

Oh, excuse me for being more detailed than the other answers and bringing up potential problems with the solution. – Lennart Regebro Jun 27 '09 at 17:07
1

+1 for making a reasonable assumption but still explicitly stating that it's an assumption. Another +1 (if I could) for adding value to the discussion. – RichieHindle Jun 27 '09 at 22:15
The release announcement for Python 3.1 (released two days ago as I write this) says "File system APIs that use unicode strings now handle paths with undecodable bytes in them." I don't know whether that will fix this potential problem, or how, but anyone concerned about it should check out Python 3.1. – RichieHindle Jun 29 '09 at 08:37
Well, in 3.x you always get unicode back, so just switching to 3.x will likely solve this issue. But of course, theres not many 3rd party modules for 3.x yet. Most notably Setuptools is lacking. – Lennart Regebro Jun 29 '09 at 09:30

Convert python filenames to unicode

6 Answers6

Linked