-1

I have a python project where I have to display data piped to me from another machine with an unknown encoding. I'm running python 3 on an Ubuntu VM. What I get is a stream of bytes (perhaps command output or a cat'd file or similar). I just need to display the data, regardless of source, as best as I can.

As a test, I'm trying to cat /dev/urandom and have it be displayed the same way via python as it would be if I typed the command myself. To make it reproducible, I've used head -n 2 /dev/urandom instead of the endless stream that you get with cat.

In bash, when I cat, the file, I get the standard random gunk. I have LANG=en_US.ETF-8. Lots of character don't really render (sort of diamond with a question mark in it) or just blanks (because its obviously not UTF-8, its just random bytes of raw data)

eJ̘��}��jf��)���N�n��t��8=����X-�L�^t�M����Z���g�8#K T��c��z�ZO+�ϩD1{|EX
��)'���ei۝{W�r��畴��Ii�Y���
                        �}���+��;-�i-
                                     S��Az
                                          uV�1XBxFZ3+4��G�*��Q�+!  

However, if I read the file in python and print to standard out, I get encoding errors unless I use 'latin-1'. I've even tried to use the encoding from the default stream, assuming it inherited from the terminal. This clearly isn't right as I would need the encoding from the stream on the remote end of the pipe(which I don't have).

>>> f = open("foo.txt", "rb")
>>> data = f.read(530)
>>> import sys
>>> sys.stdout.write(data.decode(sys.stdout.encoding))

As expected, the error produced is based on random data not following the UTF8 encoding standard :

>>> sys.stdout.write(data.decode(sys.stdout.encoding))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb1 in position 0: invalid start byte

So, its picking up the UTF-8 from LANG(?) and apply it, but it simply has an encoding issue and fails. If I explicitly use "latin-1" I get something similar, but it isn't the same as the terminal either.

sys.stdout.write(data.decode('latin-1'))

Yields (notional example -- not actual text) :

©7yIº*ø^Mÿ*Ig«áEIt±.Q   ÈyT?æsÎ_%v1DÎú¹×,sÛÐûóÜun¢$&6YuApÁ¼­pnòàJð

So, the question is... how do I read the terminal settings so I can use them to decode and reproduce exactly what would have appeared on the terminal?

I've checked out these other questions : Why does Python print unicode characters when the default encoding is ASCII? and Convert bytes to a Python string and they cover some parts, but I'm not understanding the interaction with the shell/bash/etc.

EDIT (Answer Evaluation):

Using errors="replace" gets me close. os.write duplicates the output of cat. The difference seems to be based on how error characters are clumped together, so the difference isn't really important.

EDIT EDIT (Actual Solution) : In the end, I apparently need to read the locale / remote encoding from the original machine so I can send it across, then interpret the bytes as that encoding, then transcode to the local machine encoding, then display. The project wasn't well thought through in this sense as I am not sent the remote information and I clearly need it. I ended up using this (for the current build as it gets rid of the errors) :

sys.stdout.write(data.decode('utf-8', errors='replace'))
sys.stdout.flush()

EDITs ( cleared up text of question to focus on actual topic )

Community
  • 1
  • 1
LawfulEvil
  • 2,267
  • 24
  • 46
  • 1
    `/dev/urandom` is giving you random bytes that usually don't make up valid code points/printable characters. – Jasper Mar 22 '16 at 20:35
  • also http://stackoverflow.com/questions/6396659/how-do-you-get-the-encoding-of-the-terminal-from-within-a-python-script – Jasper Mar 22 '16 at 20:49
  • 1
    I know what urandom is giving. Random bytes. I know they don't make up codes (hence the expected utf8 error). As you can see I tried exactly what was suggested in your link. It does NOT answer the question. It claims to but from the trivial test, you can see it doesn't. What encoding produces a duplication of the bash shell behaviour? – LawfulEvil Mar 22 '16 at 20:56
  • 1
    @LawfulEvil 1- you don't need to know the "terminal encoding" (whatever it is), to print bytes. You could use `sys.stdout.buffer` or any other byte interface such as `os.write()` 2- you don't need to know the "terminal encoding" to print text. Just print Unicode directly—Python 3 uses `locale.getprefferedencoding(False)` by default. User should configure the locale to accept the corresponding characters (`LC_ALL`, `LC_CTYPE`, `LANG`) or PYTHONIOENCODING`. If the terminal application accepts utf-8 then any utf-8 locale such as `en_US.UTF-8` will do. – jfs Mar 22 '16 at 22:52
  • Here's [`cat(1)` Python 3 implementation](https://gist.github.com/zed/cda879d141081e5764bd). Here's possibly [more efficient `os.sendfile()`-based solution for Linux](http://stackoverflow.com/a/14411471/4279) – jfs Mar 22 '16 at 23:06
  • The actual setup is harder. The file is coming over an encrypted connection and is simply bytes. The local machine doesn't know anything about the file type, contents, etc and simply should display the contents using whatever settings the operator specified when setting up the terminal. As you can see from the edit above, using replace errors fixed a lot of the problem, but it still isn't the same. – LawfulEvil Mar 23 '16 at 13:00
  • Decoding with "replace" is bound to cause data differences - it replaces byte sequences that are not UTF-8! Only writing directly to stdout will mimic `cat` (as per @thatotherguy's answer). Then the user's locale plays no part and is only dependent on the user's **terminal encoding**. Piping binary data to the shell is only going to end in misery and you'll risk hitting a tty escape sequence. If you're not piping binary data, then your tests above are invalid. Further, you can only re-encode text data to the user's locale if you know the encoding of the incoming bytes. – Alastair McCormack Mar 23 '16 at 18:52
  • At this rate, we're at risk of a non-reproducable situation, bogged down in an XY problem :/ – Alastair McCormack Mar 23 '16 at 18:53
  • Sadly, I'm not a liberty to explain the system. It makes it hard to seek help and harder still on those generous souls attempting to help. For some reason the architects on the system had concerns about using os.write and wanted to continue stdout use even if it meant some small differences in unprintable characters. They acknowledged that what we need to do it get the local/encoding from the original source and apply that. So, you were right @AlastairMcCormack, well done and thanks. – LawfulEvil Mar 23 '16 at 19:07
  • ok mate. Good luck! :) – Alastair McCormack Mar 23 '16 at 19:10

4 Answers4

2

Addressing the problem and not the question given (https://meta.stackexchange.com/questions/66377/what-is-the-xy-problem)

I have a python project which is getting sent to various users on various linux flavors (vms, real machines, etc) all with python 3. As part of this program, the python may display one of their local files to the terminal screen. I'm looking for a way to read the settings of the terminal to use them for displaying the data. In this way, it will be up to user to set their terminal to handle extended character sets or whatever if that is what they want.

You have already discovered that Python uses the locale to resolve the likely encoding of the terminal - If you write a Python 3 str to stdout, using write() or print(), you will find that it's automatically encoded for the terminal. That means that your code does not need (nor shouldn't have to) detect its environment.

I said "likely encoding", as the decoding of characters is not the responsibility of the shell, but the responsibility of terminal (Terminal/iTerm/Putty etc), which may be running on the user's desktop remotely. You will have to hope that most people will have left the locale and terminal encoding as default, which is now thankfully usually UTF-8.

If you're to open a text file on their local machine, you should also apply a encoding when reading that file (You could just write it straight out with decoding/encoding but you won't be able to manipulate it as cleanly). Thankfully, open() will also use locale.getpreferredencoding(), sourced from the user's locale, to set the encoding to decode the file.

That means, if a user's LANG='en_GB.UTF-8', then locale.getpreferredencoding() == "UTF-8". If you open a file with open('file.txt', 'r'), that file will be decoded as UTF-8 into a Python str.

You will be able to write/print this data to the console, confident that you've done what you can to work around the encoding legacy we have to live with.

Community
  • 1
  • 1
Alastair McCormack
  • 26,573
  • 8
  • 77
  • 100
  • In my system, the file is opened by some C program and sent as bytes to another machine so even if I read it as a string, the machine reading the file doesn't know the settings of the machine printing it so it couldn't possibly convert the bytes correctly. – LawfulEvil Mar 23 '16 at 13:18
  • So, why didn't you say that in your question? What is the actual problem? – Alastair McCormack Mar 23 '16 at 18:54
  • Because I was asking about Y, not X. Obviously. I was focused on the piece of code in question which was attempting to take unknown bytes and print them as UTF-8. If the source file was not encoded that way, it doesn't work. In the end, I have to do what you were saying but thats a bigger architectural change to the system, something for the next build. I was focused on fixing a test case (which used /dev/urandom data) not on the systemic problem. – LawfulEvil Mar 23 '16 at 19:11
1

If you want to reproduce what cat would do, then just write the data to stdout without decoding it:

import os
f = open("/dev/urandom", "rb")
data = f.read(1000)
os.write(1, data)
that other guy
  • 116,971
  • 11
  • 170
  • 194
  • In the code I am fixing had a call to sys.stdout.flush() right after the broken sys.stdout.write call. Do I need to do something similar with os.write? Can I do os.write(1, data) and then sys.stdout.flush()? Does bypassing sys.stdout get it out of sync with what os.write did? – LawfulEvil Mar 23 '16 at 13:29
  • 1
    `os.write` does not buffer and therefore does not need flushing. It bypasses any buffering that `sys.stdout` would do as well, so you'd want to flush that before using `os.write` to preserve order. – that other guy Mar 23 '16 at 16:48
1

sys.stdout (and sys.stdin and sys.stderr) are text files. They have an encoding associated with them, which you can read from their aptly named encoding attributes, and they expect to work with strings, which they will encode or decode automatically, depending on I/O direction. On Linux, you should expect the encoding to be chosen based on environment variables of the Python process's initial environment. I am unaware of any mechanism to change the encoding of an open file, but in this case the encoding should indeed be the same as the one the terminal expects.

As thatotherguy wrote, you can perform the equivalent of a cat command by reading from the source file in binary mode, and using a low-level os.write() to send the bytes to the file descriptor for stdout. Note, however, that the underlying system function does not necessarily always write the full number of bytes specified, so that you must generally call it in a loop to ensure that all the desired bytes are pushed out. The Python docs don't specify, but inasmuch as the method has an equivalent interface to that of the underlying system call, one might best assume that it has the same semantics, too.

But really, that's all going about things the wrong way. If you force a raw dump of the file's bytes, then you don't force users just to configure their environment to support extended character sets, you force them to configure their environment to support the exact character set in which the file is encoded (or else to endure gibberish).

A better solution would be to open the file as a text file, specifying the correct encoding (which presumably you know), to read in the data via the file object, so that Python decodes it correctly, and to write the resulting string(s) to stdout, whereby they will be correctly encoded for the terminal, at least inasmuch as that is possible. That way you can accommodate any character encoding at the terminal that supports all the characters in the file -- it doesn't have to actually be the same as the file's.

John Bollinger
  • 160,171
  • 8
  • 81
  • 157
0

You are almost there, just pass errors='replace' to the decoding function.

with open('/dev/urandom', 'rb') as f:
    x = f.read(100)

print(x.decode(sys.stdout.encoding, errors='replace'))

�{�ʛf$��s���<�w'`�i6�/��Z�ʫ;����ek|%�-+����V�U��;w>פ���TV��
�}���639

sys.stdout.write(x.decode(sys.stdout.encoding, errors='replace'))

�{�ʛf$��s���<�w'`�i6�/��Z�ʫ;����ek|%�-+����V�U��;w>פ���TV��
�}���639

os.write(1, x)

�{�ʛf$��s���<�w'`�i6�/���Z�ʫ;����ek|%�-+����V�U��;w>פ����TV��
�}���639
Stop harming Monica
  • 12,141
  • 1
  • 36
  • 56
  • Did you notice that your three aren't the same. The os write has /���Z while the print and stdout have /��Z – LawfulEvil Mar 23 '16 at 13:15
  • No I didn't! Weird, I made a lot of tests and the three outputs are always the same. But it looks like you found otherwise in tour own ones.... – Stop harming Monica Mar 23 '16 at 14:34