29

I've just added Python3 interpreter to Sublime, and the following code stopped working:

for directory in directoryList:
    fileList = os.listdir(directory)
    for filename in fileList:
        filename = os.path.join(directory, filename)
        currentFile = open(filename, 'rt')
        for line in currentFile:               ##Here comes the exception.
            currentLine = line.split(' ')
            for word in currentLine:
                if word.lower() not in bigBagOfWords:
                    bigBagOfWords.append(word.lower())
        currentFile.close()

I get a following exception:

  File "/Users/Kuba/Desktop/DictionaryCreator.py", line 11, in <module>
    for line in currentFile:
  File "/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/encodings/ascii.py", line 26, in decode
    return codecs.ascii_decode(input, self.errors)[0]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xcc in position 305: ordinal not in range(128)

I found this rather strange, because as far as I know Python3 is supposed to support utf-8 everywhere. What's more, the same exact code works with no problems on Python2.7. I've read about adding environmental variable PYTHONIOENCODING, but I tried it - to no avail (however, it appears it is not that easy to add an environmental variable in OS X Mavericks, so maybe I did something wrong with adding the variable? I modidified /etc/launchd.conf)

Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
3yakuya
  • 2,622
  • 4
  • 25
  • 40

3 Answers3

68

Python 3 decodes text files when reading, encodes when writing. The default encoding is taken from locale.getpreferredencoding(False), which evidently for your setup returns 'ASCII'. See the open() function documenation:

In text mode, if encoding is not specified the encoding used is platform dependent: locale.getpreferredencoding(False) is called to get the current locale encoding.

Instead of relying on a system setting, you should open your text files using an explicit codec:

currentFile = open(filename, 'rt', encoding='latin1')

where you set the encoding parameter to match the file you are reading.

Python 3 supports UTF-8 as the default for source code.

The same applies to writing to a writeable text file; data written will be encoded, and if you rely on the system encoding you are liable to get UnicodeEncodingError exceptions unless you explicitly set a suitable codec. What codec to use when writing depends on what text you are writing and what you plan to do with the file afterward.

You may want to read up on Python 3 and Unicode in the Unicode HOWTO, which explains both about source code encoding and reading and writing Unicode data.

Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
  • Well, it works indeed. When i change encoding to 'utf-8' in open it does not run again. Is this fine? – 3yakuya May 28 '14 at 17:09
  • 1
    Clear now, thank you. So only source code is expected to be always utf-8. – 3yakuya May 28 '14 at 17:11
  • 1
    @Byakuya: yes, source code is by default expected to use UTF-8; you can use a `# codec: ..` comment as first or second line to indicate a different encoding for the source file. – Martijn Pieters May 28 '14 at 17:13
  • @Cerin I highly doubt that. Either you are not really running Python 3 (check `sys.version_info`) or you rebound the name `open` to something other than the built-in. – Martijn Pieters Apr 16 '16 at 00:18
  • @Cerin: whenever you encounter a situation where your code doesn't match the documentation (which I linked to in my answer), triple-check your assumptions. Note that the error message you got is *exactly* the error message Python 2 would throw if you tried to use `encoding` in an `open()` call. – Martijn Pieters Apr 16 '16 at 09:11
  • This seems a bit tedious. Is there a way to tell the interpreter to expect UTF-8 whenever reading or writing files? If it can try to read it with `locale.getpreferredencoding(False)`, surely there must be a way to also manually set it instead of letting it read from the system? – xji Apr 26 '17 at 11:16
  • 3
    @JIXiang: you could set your locale to UTF-8 before starting Python. However, I find it prudent to set the encoding explicitly, regardless. Explicit is better than implicit. – Martijn Pieters Apr 26 '17 at 11:17
  • @EamonnKenny: the difference then is not the Debian or Python release but the locale setup on each machine. Take a look at the output of the `locale` command on either. – Martijn Pieters Jun 21 '18 at 14:14
  • @MartijnPieters actually now the problem is worse. A file that was working correctly under Python 3.4 with a different locale was working. It isn't working now. I'm now getting 0xe2 errors. So must be a python issue. – Eamonn Kenny Jun 21 '18 at 14:21
  • @EamonnKenny: you are not giving me enough information to even begin helping you. You are probably getting *UnicodeDecodeError* or *UnicodeEncodeError* exceptions, *0xe2 errors*. The error can come from a *huge* number of places, and it is rare that it is Python that is at fault. I can't help you here, at any rate, post a question with a [mcve] and we can see if we can help you. – Martijn Pieters Jun 21 '18 at 14:28
  • @MartijnPieters You were right! Instead of changing LANG i changed /etc/locale.gen to use en_IE.utf8 and then run the code and now the locale is interpreted corrected for reading utf-8 and latin-1 files. Thanks. – Eamonn Kenny Jun 21 '18 at 14:37
  • NB: The routine for finding the fallback encoding used by Python3 changes over time and varies by platforms. For example Python 3.4 for macOS you can see https://github.com/python/cpython/blob/v3.7.4/Lib/locale.py#L659 and ultimately https://github.com/python/cpython/blob/v3.7.4/Lib/_bootlocale.py#L45 fallbacks to UTF-8 if nothing else can be found. – Anon Sep 24 '19 at 15:38
  • 1
    @Anon: you say *Python 3.4* then link to the Python 3.7 source code; I think you mean that *from 3.4 onwards* there is a simplified fallback. See https://bugs.python.org/issue9548. The OSX fallback goes [much further back](https://bugs.python.org/issue6393) and was part of Python 3.0 (and was later backported to Python 2.7). The fallback is only used if `LANG` was set to an invalid value. – Martijn Pieters Sep 25 '19 at 12:15
  • @MartijnPieters you're absolutely correct! I meant to say from 3.4 onwards, I actually meant to link to the 3.4 version of the repo (e.g. https://github.com/python/cpython/blob/v3.4.5/Lib/locale.py#L635 and https://github.com/python/cpython/blob/v3.4.5/Lib/_bootlocale.py#L33 ) and yup the OSX fallback is older (and while I saw code referring to the old fix it took a while to work out where it lives in modern Python versions :-). – Anon Sep 28 '19 at 06:49
1

"as far as I know Python3 is supposed to support utf-8 everywhere ..." Not true. I have python 3.6 and my default encoding is NOT utf-8. To change it to utf-8 in my code I use:

import locale
def getpreferredencoding(do_setlocale = True):
   return "utf-8"
locale.getpreferredencoding = getpreferredencoding

as explained in Changing the “locale preferred encoding” in Python 3 in Windows

Farid Khafizov
  • 1,062
  • 12
  • 8
1

In general, I found 3 ways to fix Unicode related Errors in Python3:

  1. Use the encoding explicitly like currentFile = open(filename, 'rt',encoding='utf-8')

  2. As the bytes have no encoding, convert the string data to bytes before writing to file like data = 'string'.encode('utf-8')

  3. Especially in Linux environment, check $LANG. Such issue usually arises when LANG=C which makes default encoding as 'ascii' instead of 'utf-8'. One can change it with other appropriate value like LANG='en_IN'

vicky_kqr
  • 11
  • 1