2

I want to read a binary PNM image file from stdin. The file contains a header which is encoded as ASCII text, and a payload which is binary. As a simplified example of reading the header, I have created the following snippet:

#! /usr/bin/env python3
import sys
header = sys.stdin.readline()
print("header=["+header.strip()+"]")

I run it as "test.py" (from a Bash shell), and it works fine in this case:

$ printf "P5 1 1 255\n\x41" |./test.py 
header=[P5 1 1 255]

However, a small change in the binary payload breaks it:

$ printf "P5 1 1 255\n\x81" |./test.py 
Traceback (most recent call last):
  File "./test.py", line 3, in <module>
    header = sys.stdin.readline()
  File "/usr/lib/python3.4/codecs.py", line 313, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x81 in position 11: invalid start byte

Is there an easy way to make this work in Python 3?

Brent Bradburn
  • 51,587
  • 17
  • 154
  • 173
  • did you try to change the input encoding ? http://stackoverflow.com/a/16549381/4954037 – hiro protagonist Jul 18 '15 at 11:03
  • @hiroprotagonist: Thanks for the hint. The approach indicated there did lead me to one possible solution -- although it is a bit of a hack to apply Unicode decoding to arbitrary binary data. – Brent Bradburn Jul 19 '15 at 01:42

2 Answers2

2

To read binary data, you should use a binary stream e.g., using TextIOBase.detach() method:

#!/usr/bin/env python3
import sys

sys.stdin = sys.stdin.detach() # convert to binary stream
header = sys.stdin.readline().decode('ascii') # b'\n'-terminated
print(header, end='')
print(repr(sys.stdin.read()))
jfs
  • 399,953
  • 195
  • 994
  • 1,670
1

From the docs, it is possible to read binary data (as type bytes) from stdin with sys.stdin.buffer.read():

To write or read binary data from/to the standard streams, use the underlying binary buffer object. For example, to write bytes to stdout, use sys.stdout.buffer.write(b'abc').

So this is one direction that you can take -- read the data in binary mode. readline() and various other functions still work. Once you have captured the ASCII string, it can be converted to text, using decode('ASCII'), for additional text-specific processing.

Alternatively, you can use io.TextIOWrapper() to indicate the use of the latin-1 character set on the input stream. With this, the implicit decode operation will essentially be a pass-through operation -- so the data will be of type str (which represent text), but the data is represented with a 1-to-1 mapping from the binary (although it could be using more than one storage byte per input byte).

Here's code that works in either mode:

#! /usr/bin/python3

import sys, io

BINARY=True ## either way works

if BINARY: istream = sys.stdin.buffer
else:      istream = io.TextIOWrapper(sys.stdin.buffer,encoding='latin-1')

header = istream.readline()
if BINARY: header = header.decode('ASCII')
print("header=["+header.strip()+"]")

payload = istream.read()
print("len="+str(len(payload)))
for i in payload: print( i if BINARY else ord(i) )

Test every possible 1-pixel payload with the following Bash command:

for i in $(seq 0 255) ; do printf "P5 1 1 255\n\x$(printf %02x $i)" |./test.py ; done
Brent Bradburn
  • 51,587
  • 17
  • 154
  • 173
  • 1
    The hack of using `latin-1` as a conduit for binary data works because it is [8-bit clean](https://en.wikipedia.org/wiki/8-bit_clean), whereas [UTF-8](https://en.wikipedia.org/wiki/UTF-8) is not. – Brent Bradburn Jul 19 '15 at 01:47