45

While porting code from Python 2 to Python 3, I run into this problem when reading UTF-8 text from standard input. In Python 2, this works fine:

for line in sys.stdin:
    ...

But Python 3 expects ASCII from sys.stdin, and if there are non-ASCII characters in the input, I get the error:

UnicodeDecodeError: 'ascii' codec can't decode byte .. in position ..: ordinal not in range(128)

For a regular file, I would specify the encoding when opening the file:

with open('filename', 'r', encoding='utf-8') as file:
    for line in file:
        ...

But how can I specify the encoding for standard input? Other SO posts (e.g. How to change the stdin encoding on python) have suggested using

input_stream = codecs.getreader('utf-8')(sys.stdin)
for line in input_stream:
    ...

However, this doesn't work in Python 3. I still get the same error message. I'm using Ubuntu 12.04.2 and my locale is set to en_US.UTF-8.

Seppo Enarvi
  • 3,219
  • 3
  • 32
  • 25

1 Answers1

98

Python 3 does not expect ASCII from sys.stdin. It'll open stdin in text mode and make an educated guess as to what encoding is used. That guess may come down to ASCII, but that is not a given. See the sys.stdin documentation on how the codec is selected.

Like other file objects opened in text mode, the sys.stdin object derives from the io.TextIOBase base class; it has a .buffer attribute pointing to the underlying buffered IO instance (which in turn has a .raw attribute).

Wrap the sys.stdin.buffer attribute in a new io.TextIOWrapper() instance to specify a different encoding:

import io
import sys

input_stream = io.TextIOWrapper(sys.stdin.buffer, encoding='utf-8')

Alternatively, set the PYTHONIOENCODING environment variable to the desired codec when running python.

From Python 3.7 onwards, you can also reconfigure the existing std* wrappers, provided you do it at the start (before any data has been read):

# Python 3.7 and newer
sys.stdin.reconfigure(encoding='utf-8')
Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
  • 2
    What's the nearest equivalent for python2.6? – bukzor Dec 12 '13 at 22:08
  • 1
    @bukzor: Next option: open the file descriptor directly with `io.open()`; `0` is `stdin`: `io.open(0)` returns a `TextIOWrapper()` object. – Martijn Pieters Dec 16 '13 at 21:51
  • 1
    @MartijnPieters: That works pretty great! Thanks! Whole script: http://paste.pound-python.org/show/xoUPpsfFhtKssXBzLxBd/ Deleting my previous failures. – bukzor Dec 17 '13 at 01:53
  • you could call `sys.stdin.detach()` instead of `sys.stdin.buffer`. Though a preferable solution is to leave the source code along and to configure the environment instead (locale, PYTHONIOENCODING). – jfs May 23 '16 at 18:50
  • If I use `io.TextIOWrapper(sys.stdin.buffer, encoding='utf-8')` to read from `stdin` in python 2.7, it syas `AttributeError: 'file' object has no attribute 'buffer'`. How can one make reading from `stdin` compatible with both python 2 and 3. – Irshad Bhat Jun 27 '16 at 00:25
  • @IrshadBhat: Did you see [Wrap an open stream with io.TextIOWrapper](http://stackoverflow.com/q/34447623)? – Martijn Pieters Jun 27 '16 at 09:54
  • Is there a way to use `sys.stdin.buffer` in Python2 too? – alvas Nov 22 '17 at 01:35
  • 1
    @alvas: to read binary? See [Reading binary data from stdin](//stackoverflow.com/q/2850893) – Martijn Pieters Nov 22 '17 at 07:51
  • I'm reading text but would like the code to support Python2 and Python3 without doing if sys version =( Specifically https://stackoverflow.com/questions/47425695/how-to-read-inputs-from-stdin-and-enforce-an-encoding – alvas Nov 22 '17 at 08:12
  • In order to **only** change the encoding, one should preserve the other stream parameters: `io.TextIOWrapper(sys.stdin.buffer, encoding='utf-8', errors=sys.stdin.errors, newline=sys.stdin.newlines, line_buffering=sys.stdin.line_buffering)` However I haven't found a way to acquire the `write_through` parameter. – CMCDragonkai Jun 26 '18 at 02:07
  • @CMCDragonkai: Python 3.7 adds a [`write_through` attribute](https://docs.python.org/3.7/library/io.html#io.TextIOWrapper.write_through), and more importantly, lets you [reconfigure the wrapper](https://docs.python.org/3.7/library/io.html#io.TextIOWrapper.reconfigure). – Martijn Pieters Jun 26 '18 at 13:39
  • @Martijn Peters: What happens under your suggestions, if, say, one in 1000 bytes has been somehow corrupted, and causes a local deviation from utf-8? – David Epstein Jan 03 '19 at 10:42
  • @DavidEpstein: that's a rather hypothetical situation. The default error handler is `strict`, so an exception will be raised when you try to read data from stdin that is not proper UTF-8. Set the `errors` option to a different error handler to change that behaviour, but corrupted data is corrupted data. – Martijn Pieters Jan 03 '19 at 10:46
  • @Martin Pieters In articles about utf-8 it is pointed out that an advantage of utf-8 over many other encodings is that it is easy to recover from a rare error in a byte stream. So what you say sounds very useful in some circumstances (depending on the source of the input). – David Epstein Jan 03 '19 at 12:28
  • @DavidEpstein: yes, and [setting an error handler other than `strict`](https://docs.python.org/3/library/codecs.html#error-handlers) will let you skip corrupted bytes until a new valid start byte for a sequence is found. None of which has much to do with this specific answer, that is universally applicable to Python's encoding handling. – Martijn Pieters Jan 03 '19 at 12:43
  • `sys.stdin.reconfigure(encoding='utf-8')` gives **AttributeError: 'StdInputFile' object has no attribute 'reconfigure'** in Python 3.9.0 – Suncatcher Dec 24 '20 at 10:16
  • @Suncatcher: `StdInputFile` is *not a standard library type*. The `reconfigure()` method only exists on the [`io.TextIOWrapper()` class](https://docs.python.org/3.7/library/io.html#io.TextIOWrapper). It appears you are using Python in an IDE or other specialised environment. You'd have that issue in any Python version. – Martijn Pieters Dec 24 '20 at 16:01
  • I import `sys` and putting this line into script and calling this `.py` file in an IDLE based on 3.9.0 like that `exec(open('C:\\script.py').read())` so I believe it is clean-green standard, I am too newbie to use any additional modules or libraries on top of standard Python – Suncatcher Dec 24 '20 at 16:22
  • 1
    @Suncatcher: IDLE is the IDE here, and has replaced the standard `sys.stdout` object with a custom object. That class is part of the IDLE internal implementation, not a standard library class. – Martijn Pieters Dec 24 '20 at 16:38
  • So when you `reconfigure` it does apply only to the current file? – Or b Jan 03 '22 at 19:59