I have a python project where I have to display data piped to me from another machine with an unknown encoding. I'm running python 3 on an Ubuntu VM. What I get is a stream of bytes (perhaps command output or a cat'd file or similar). I just need to display the data, regardless of source, as best as I can.
As a test, I'm trying to cat /dev/urandom and have it be displayed the same way via python as it would be if I typed the command myself. To make it reproducible, I've used head -n 2 /dev/urandom instead of the endless stream that you get with cat.
In bash, when I cat, the file, I get the standard random gunk. I have LANG=en_US.ETF-8. Lots of character don't really render (sort of diamond with a question mark in it) or just blanks (because its obviously not UTF-8, its just random bytes of raw data)
eJ̘��}��jf��)���N�n��t��8=����X-�L�^t�M����Z���g�8#K T��c��z�ZO+�ϩD1{|EX
��)'���ei{W�r��畴��Ii�Y���
�}���+��;-�i-
S��Az
uV�1XBxFZ3+4��G�*��Q�+!
However, if I read the file in python and print to standard out, I get encoding errors unless I use 'latin-1'. I've even tried to use the encoding from the default stream, assuming it inherited from the terminal. This clearly isn't right as I would need the encoding from the stream on the remote end of the pipe(which I don't have).
>>> f = open("foo.txt", "rb")
>>> data = f.read(530)
>>> import sys
>>> sys.stdout.write(data.decode(sys.stdout.encoding))
As expected, the error produced is based on random data not following the UTF8 encoding standard :
>>> sys.stdout.write(data.decode(sys.stdout.encoding))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb1 in position 0: invalid start byte
So, its picking up the UTF-8 from LANG(?) and apply it, but it simply has an encoding issue and fails. If I explicitly use "latin-1" I get something similar, but it isn't the same as the terminal either.
sys.stdout.write(data.decode('latin-1'))
Yields (notional example -- not actual text) :
©7yIº*ø^Mÿ*Ig«áEIt±.Q ÈyT?æsÎ_%v1DÎú¹×,sÛÐûóÜun¢$&6YuApÁ¼pnòàJð
So, the question is... how do I read the terminal settings so I can use them to decode and reproduce exactly what would have appeared on the terminal?
I've checked out these other questions : Why does Python print unicode characters when the default encoding is ASCII? and Convert bytes to a Python string and they cover some parts, but I'm not understanding the interaction with the shell/bash/etc.
EDIT (Answer Evaluation):
Using errors="replace" gets me close. os.write duplicates the output of cat. The difference seems to be based on how error characters are clumped together, so the difference isn't really important.
EDIT EDIT (Actual Solution) : In the end, I apparently need to read the locale / remote encoding from the original machine so I can send it across, then interpret the bytes as that encoding, then transcode to the local machine encoding, then display. The project wasn't well thought through in this sense as I am not sent the remote information and I clearly need it. I ended up using this (for the current build as it gets rid of the errors) :
sys.stdout.write(data.decode('utf-8', errors='replace'))
sys.stdout.flush()
EDITs ( cleared up text of question to focus on actual topic )