How to read inputs from stdin and enforce an encoding?

Question

The goal is to continuously read from stdin and enforce utf8 in both Python2 and Python3.

I've tried solutions from:

I've tried:

#!/usr/bin/env python

from __future__ import print_function, unicode_literals
import io
import sys

# Supports Python2 read from stdin and Python3 read from stdin.buffer
# https://stackoverflow.com/a/23932488/610569
user_input = getattr(sys.stdin, 'buffer', sys.stdin)


# Enforcing utf-8 in Python3
# https://stackoverflow.com/a/16549381/610569
with io.TextIOWrapper(user_input, encoding='utf-8') as fin:
    for line in fin:
        # Reads the input line by line
        # and do something, for e.g. just print line.
        print(line)

The code works in Python3 but in Python2, the TextIOWrapper doesn't have a read function and it throws:

Traceback (most recent call last):
  File "testfin.py", line 12, in <module>
    with io.TextIOWrapper(user_input, encoding='utf-8') as fin:
AttributeError: 'file' object has no attribute 'readable'

That's because in Python the user_input , i.e. sys.stdin.buffer is an _io.BufferedReader object and its attribute has readable:

<class '_io.BufferedReader'>

['__class__', '__del__', '__delattr__', '__dict__', '__dir__', '__doc__', '__enter__', '__eq__', '__exit__', '__format__', '__ge__', '__getattribute__', '__getstate__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__iter__', '__le__', '__lt__', '__ne__', '__new__', '__next__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '_checkClosed', '_checkReadable', '_checkSeekable', '_checkWritable', '_dealloc_warn', '_finalizing', 'close', 'closed', 'detach', 'fileno', 'flush', 'isatty', 'mode', 'name', 'peek', 'raw', 'read', 'read1', 'readable', 'readinto', 'readinto1', 'readline', 'readlines', 'seek', 'seekable', 'tell', 'truncate', 'writable', 'write', 'writelines']

While in Python2 the user_input is a file object and its attributes don't have readable:

<type 'file'>

['__class__', '__delattr__', '__doc__', '__enter__', '__exit__', '__format__', '__getattribute__', '__hash__', '__init__', '__iter__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', 'close', 'closed', 'encoding', 'errors', 'fileno', 'flush', 'isatty', 'mode', 'name', 'newlines', 'next', 'read', 'readinto', 'readline', 'readlines', 'seek', 'softspace', 'tell', 'truncate', 'write', 'writelines', 'xreadlines']

lenz · Accepted Answer · 2017-11-25T10:46:19.977

1

If you don't need a fully-fledged io.TextIOWrapper, but just a decoded stream for reading, you can use codecs.getreader() to create a decoding wrapper:

reader = codecs.getreader('utf8')(user_input)
for line in reader:
    # do whatever you need...
    print(line)

codecs.getreader('utf8') creates a factory for a codecs.StreamReader, which is then instantiated using the original stream. I'm not sure the StreamReader supports the with context, but this might not be strictly necessary (there's no need to close STDIN after reading, I guess...).

I've successfully used this solution in situations where the underlying stream only offers a very limited interface.

Update (2nd version)

From the comments, it became clear that you actually need an io.TextIOWrapper to have proper line buffering etc. in interactive mode; codecs.StreamReader only works for piped input and the like.

Using this answer, I was able to get interactive input work properly:

#!/usr/bin/env python
# coding: utf8

from __future__ import print_function, unicode_literals
import io
import sys

user_input = getattr(sys.stdin, 'buffer', sys.stdin)

with io.open(user_input.fileno(), encoding='utf8') as f:
    for line in f:
        # do whatever you need...
        print(line)

This creates an io.TextIOWrapper with enforced encoding from the binary STDIN buffer.

edited Nov 25 '17 at 10:46

answered Nov 22 '17 at 11:05

lenz

5,658
5
24
44

Not really right if the buffer is needed to stream. If you try the code snippet in the OP using Python3, you would see a different behavior. The `sys.stdin` behaves differently from a normal `input()` or `raw_input()`. In my scenario, the stdin is sort of necessary to keep the stream, e.g. if there's a socket and the stream shouldn't be closed. – alvas Nov 23 '17 at 00:47
For context, this code is to be used in https://github.com/marian-nmt/marian-dev/blob/master/scripts/server/client_example.py where the socket is open for user input from stdin. Although it's possible to write a while loop to use `input()`, it's a little odd to do so when `stdin` inherently does that. The issue is when utf8 string is passed, there's a need to handle it, thus the `io.TextIOWrapper` =) – alvas Nov 23 '17 at 00:50
I'm not sure I understand your comments. I didn't think about the built-in `[raw_]input()`, I just reused your `user_input` variable, which is defined in the OP as `getattr(sys.stdin, 'buffer', sys.stdin)`. Unless there is a mistake, the proposed solution should work with streams (it doesn't close STDIN or something). – lenz Nov 23 '17 at 07:15
Try the script in the OP, it behaves a little differently from yours =) – alvas Nov 23 '17 at 07:23
@alvas what is this different behaviour you talk about? Can you be more specific? I tried to be more clear about the proposed solution. – lenz Nov 23 '17 at 13:32
Try using `stdin` interactively as your input instead of piping the input. When using `stdin` it behaves differently. For your script in Python3, `enter` then `ctr+D` needs to be invoked before the script does something. But in the OP with Python, the enter flushes the buffer. – alvas Nov 24 '17 at 06:17
Try the script in the OP and just run and test, I hope the explanation makes sense. – alvas Nov 24 '17 at 06:18
1

@alvas I see now. Yes, with the `codecs.StreamReader`, you need repeated `ctrl+D` signals to trigger flushing. And it took me three of them to end the script... – lenz Nov 25 '17 at 10:14
I made a new update to the answer. This worked fine in interactive mode when I tested it. – lenz Nov 25 '17 at 10:47
Thanks for figuring out the right syntax to replicate the OP code behavior in Python2 and Python3! Now it's a puzzle why we would need these boilerplates code to buffer streaming, I'll dig into the CPython code =) – alvas Nov 27 '17 at 03:18
Well, to some extent this boilerplate is the price you pay for Py2/3 compatibility, I guess. I've seen worse, though... – lenz Nov 27 '17 at 09:01
1

The lesson I learnt in this is that in most cases you shouldn't try to instantiate `io` classes directly, but instead use `io.open()`. – lenz Nov 27 '17 at 09:03
This is a good solution, but I wonder, should we pass closefd=False to io.open() to prevent the stdin file descriptor from being closed? (I'm not sure the what would be the consequences of closing a stdio file descriptor on diferent platforms). – ejm Jan 30 '19 at 14:45
@ejm I guess closing a stdio is not a problem, unless you want to further use it in the same process of course. I generally try to avoid too, though. You can just omit the `with` statement, so the file won't be closed. – lenz Jan 30 '19 at 17:34

score -1 · Answer 2 · answered Nov 27 '17 at 08:12

-1

Have you tried forcing utf-8 encoding in python as follow :

import sys
reload(sys)
sys.setdefaultencoding('utf-8')

answered Nov 27 '17 at 08:12

sancelot

1,905
12
31

The point was to avoid setting locale. Thus the script in supporting Python2 and Python3. And also, reloading default encoding is discouraged =( – alvas Nov 27 '17 at 08:18
3

This affects implicit conversions between Unicode and ASCII, **globally**. This is a terrible idea, as libraries have been built to rely on non-ASCII data throwing an exception. That expectation is broken with this change. – Martijn Pieters Nov 27 '17 at 12:52

How to read inputs from stdin and enforce an encoding?

2 Answers2

Update (2nd version)

Linked