2

I have a piece of code:

with open('filename.txt','r') as textfile:
    kwList = [x.strip('\n') for x in textfile.readlines()]

I get a: UnicodeDecodeError : 'ascii' codec can't decode byte 0xc4 in position 5595: ordinal not in range(128) on line 2

The problem is that according the python docs : https://docs.python.org/3/library/functions.html#open

Python3 uses locale.getpreferredencoding(False) to get the default encoding to use when there is no encoding specified in the open method.

When I run locale.getpreferredencoding(False), I get 'UTF-8'.

Why do I get 'ascii' codec failed in the UnicodeDecodeError when Python should use 'utf-8' to do this?

Chintan Shah
  • 935
  • 2
  • 9
  • 28
  • 2
    The locale depends on the *context* you are running your script in. Run the `locale.getpreferredencoding(False)` command in the same context. – Martijn Pieters May 11 '16 at 12:08
  • 1
    Is the UTF-8 preferred encoding being given in the same run of the same code (e. g. you added a `print(locale.getpreferredencoding(False))` directly above your `with open(...) as textfile` or via some other means? – Sean Vieira May 11 '16 at 12:08
  • 2
    And why not simply set the `encoding` argument to the `open()` call? – Martijn Pieters May 11 '16 at 12:08
  • @MartijnPieters, I can pass the encoding to the open() call and I have, this is just out of curiosity. On production servers I face this problem. – Chintan Shah May 11 '16 at 12:10
  • What do you exactly mean by context? Also I ran locale.getpreferredencoding(False) with the same user that the script runs with on the production code. Is there any other way to reproduce what you are talking about? – Chintan Shah May 11 '16 at 12:15
  • 1
    @ChintanShah: your production code may use the same user, but that doesn't mean that that code uses the same locale. If you are running this on a POSIX system (Mac, Linux, etc.) then the encoding is taken from the `LC_CTYPE` environment variable, which if not set explicitly is derived from `LC_ALL` or `LANG`. So if you production code is run with `LANG=C` or `LC_ALL=C`, then the default C locale is used which uses ASCII as the encoding. – Martijn Pieters May 11 '16 at 12:20
  • Can you explain what do you mean by 'context' – Chintan Shah May 11 '16 at 12:24
  • @MartijnPieters, my LC_CTYPE of the user that executes the production code is "en_US.UTF-8" – Chintan Shah May 11 '16 at 12:31
  • @ChintanShah: but what is the preferred locale for *the code running in production*. How is that code run in the first place? You are still focusing on just opening Python under that user account, but that's not the whole picture. – Martijn Pieters May 11 '16 at 12:34
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/111640/discussion-between-chintan-shah-and-martijn-pieters). – Chintan Shah May 11 '16 at 12:51
  • To use UTF-8 explicitly, `import codecs` and use `codecs.open('filename.txt', mode='r', encoding='UTF-8')`. Then don't worry about changes in context and locale. – mpez0 May 11 '16 at 14:03
  • @mpez0: do not use `codecs.open()`. `open()` has `encoding` parameter already (on Python3). Use `io.open()` on Python 2. – jfs May 11 '16 at 15:00
  • @J.F.Sebastian You're right; I usually use codes for compatibility back and forth between 2.x and 3.x. I should investigate io.open – mpez0 May 11 '16 at 15:12

1 Answers1

2

The locale is taken from the context; on POSIX systems, that means the environment variables, see the POSIX locale documentation. You'll need to reproduce the exact context of your production environment if you want to test what encoding Python will decide on (e.g. copy the environment variables used by the production environment too).

You are probably running your program as a subprocess of something that only sets (or inherits) the effective user, but does not copy the environment for that user. Either an explicit locale has been set by that parent process or, if none is set, the default C locale is used. The default encoding for that locale is ASCII; some systems will report this by the name ANSI_X3.4-1968:

$ LANG=C python -c 'import locale; print(locale.getpreferredencoding(False))'
ANSI_X3.4-1968

If, for example, your production code is run from cron, then the environment variables are not set when you set a specific user. Set LC_ALL environment variable explicitly at the top of your crontab:

LC_ALL=en.UTF-8

if your cron implementation supports setting variables this way, or set it on the command line you are going to run:

* * * * *    LC_ALL=nb_NO.UTF-8 /path/to/your/program

See Where can I set environment variables that crontab will use?

Community
  • 1
  • 1
Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
  • Any idea what might be the reason for getting `ANSI_X3.4-1968` from `LC_ALL=en_US.utf8 python -c 'import locale; print locale.getpreferredencoding(False)'` while `locale -a` returns (amount other results) `en_US.utf8`? – Piotr Dobrogost Mar 26 '17 at 21:42
  • @PiotrDobrogost: This can depend on your OS too. I also find that different Python versions are being difficult about the spelling; on Python 3.6, using `UTF-8` works (so `LC_ALL=en_US.UTF-8`). I'm looking into this some more now, but it is not quite operating the way I expected it to on my Mac either. – Martijn Pieters Mar 27 '17 at 09:24