16

I am on python 2.6 for Windows.

I use os.walk to read a file tree. Files may have non-7-bit characters (German "ae" for example) in their filenames. These are encoded in Pythons internal string representation.

I am processing these filenames with Python library functions and that fails due to wrong encoding.

How can I convert these filenames to proper (unicode?) python strings?

I have a file "d:\utest\ü.txt". Passing the path as unicode does not work:

>>> list(os.walk('d:\\utest'))
[('d:\\utest', [], ['\xfc.txt'])]
>>> list(os.walk(u'd:\\utest'))
[(u'd:\\utest', [], [u'\xfc.txt'])]
Bernd
  • 3,390
  • 2
  • 23
  • 31
  • 2
    It DOES work: Look at your output!! Both the directory name u'd:\\utest' and the file name u'\xfc.txt' are presented as unicode objects u'blahblah' instead of the previous str objects 'blahblah'. Perhaps the fact that the u-with-umlaut is represented as \xfc is boggling you but that's the same with str as with unicode and is nothing to do with the str/unicode problem. – John Machin Jun 27 '09 at 10:42
  • Perhaps you need to amplify "fails due to wrong encoding" ... What fails? How? Show the full traceback and error message. – John Machin Jun 27 '09 at 12:07

6 Answers6

47

If you pass a Unicode string to os.walk(), you'll get Unicode results:

>>> list(os.walk(r'C:\example'))          # Passing an ASCII string
[('C:\\example', [], ['file.txt'])]
>>> 
>>> list(os.walk(ur'C:\example'))        # Passing a Unicode string
[(u'C:\\example', [], [u'file.txt'])]
RichieHindle
  • 272,464
  • 47
  • 358
  • 399
  • except that if a filename is undecodable then [you can get a bytestring instead of Unicode in Python 2](http://stackoverflow.com/a/22314324/4279) – jfs May 17 '15 at 19:21
6

I was looking for a solution for Python 3.0+. Will put it up here incase someone else needs it.

rootdir = r'D:\COUNTRY\ROADS\'
fs_enc = sys.getfilesystemencoding()
for (root, dirname, filename) in os.walk(rootdir.encode(fs_enc)):
    # do your stuff here, but remember that now
    # root, dirname, filename are represented as bytearrays
  • @ramdaz 1. As the syntax highlighter demonstrates: you should not use odd slash before the closing quote. It is a `SyntaxError` in Python. 2. The code in the answer passed `bytes` to `os.walk()` and therefore the filenames are produced as bytes. You should use Unicode (drop `.encode()` call) instead. OP is confused. Passing Unicode is the right thing on Windows. – jfs May 17 '15 at 19:19
  • Seems to have stopped working with Python 3.5: `TypeError: os.scandir() doesn't support bytes path on Windows, use Unicode instead` – koppor Dec 16 '15 at 17:16
4

a more direct way might be to try the following -- find your file system's encoding, and then convert that to unicode. for example,

unicode_name = unicode(filename, "utf-8", errors="ignore")

to go the other way,

unicode_name.encode("utf-8")
gatoatigrado
  • 16,580
  • 18
  • 81
  • 143
4
os.walk(unicode(root_dir, 'utf-8'))
SomethingDark
  • 13,229
  • 5
  • 50
  • 55
Pegasus
  • 1,398
  • 15
  • 20
2

os.walk isn't specified to always use os.listdir, but neither is it listed how Unicode is handled. However, os.listdir does say:

Changed in version 2.3: On Windows NT/2k/XP and Unix, if path is a Unicode object, the result will be a list of Unicode objects. Undecodable filenames will still be returned as string objects.

Does simply using a Unicode argument work for you?

for dirpath, dirnames, filenames in os.walk(u"."):
  print dirpath
  for fn in filenames:
    print "   ", fn
1

No, they are not encoded in Pythons internal string representation, there is no such thing. They are encoded in the encoding of the operating system/file system. Passing in unicode works for os.walk though.

I don't know how os.walk behaves when filenames can't be decoded, but I assume that you'll get a string back, like with os.listdir(). In that case you'll again have problems later. Also, not all of Python 2.x standard library will accept unicode parameters properly, so you may need to encode them as strings anyway. So, the problem may in fact be somewhere else, but you'll notice if that is the case. ;-)

If you need more control of the decoding you can't always pass in a string, and then just decode it with filename = filename.decode() as usual.

Lennart Regebro
  • 167,292
  • 41
  • 224
  • 251
  • 1
    Oh, excuse me for being more detailed than the other answers and bringing up potential problems with the solution. – Lennart Regebro Jun 27 '09 at 17:07
  • 1
    +1 for making a reasonable assumption but still explicitly stating that it's an assumption. Another +1 (if I could) for adding value to the discussion. – RichieHindle Jun 27 '09 at 22:15
  • The release announcement for Python 3.1 (released two days ago as I write this) says "File system APIs that use unicode strings now handle paths with undecodable bytes in them." I don't know whether that will fix this potential problem, or how, but anyone concerned about it should check out Python 3.1. – RichieHindle Jun 29 '09 at 08:37
  • Well, in 3.x you always get unicode back, so just switching to 3.x will likely solve this issue. But of course, theres not many 3rd party modules for 3.x yet. Most notably Setuptools is lacking. – Lennart Regebro Jun 29 '09 at 09:30