1

I am trying doing a thing that goes through every file in a directory, but it crashes every time it meets a file that has an umlaute in the name. Like ä.txt

the shortened code:

import codecs
import os

for filename in os.listdir(WATCH_DIRECTORY):
    with codecs.open(filename, 'rb', 'utf-8') as rawdata:
        data = rawdata.readline()
        # ...

And then I get this:

IOError: [Errno 2] No such file or directory: '\xc3\xa4.txt'

I've tried to encode/decode the filename variable with .encode('utf-8'), .decode('utf-8') and both combined. This usually leads to "ascii cannot decode blah blah"

I also tried unicode(filename) with and without encode/decode.

Soooo, kinda stuck here :)

T-101
  • 209
  • 2
  • 13
  • It has likely been encoded in a different format. See this post: http://stackoverflow.com/questions/6539881/python-converting-from-iso-8859-1-latin1-to-utf-8 – sloppypasta Oct 13 '16 at 14:20
  • @morbidlycurious: the data is supplied by `os.listdir()`, so it is either already decoded (if `WATCH_DIRECTORY` is a unicode path), or is encoded bytes, at which point you don't need to decode because the OS has already given you the data in the right encoding. All you need to do is recombine that filename with the full base path of the directory.. – Martijn Pieters Oct 13 '16 at 15:00

1 Answers1

4

You are opening a relative directory, you need to make them absolute.

This has nothing really to do with encodings; both Unicode strings and byte strings will work, especially when soured from os.listdir().

However, os.listdir() produces just the base filename, not a path, so add that back in:

for filename in os.listdir(WATCH_DIRECTORY):
    fullpath = os.path.join(WATCH_DIRECTORY, filename)
    with codecs.open(fullpath, 'rb', 'utf-8') as rawdata:

By the way, I recommend you use the io.open() function rather than codecs.open(). The io module is the new Python 3 I/O framework, and is a lot more robust than the older codecs module.

Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343