Python UnicodeDecodeError - How to correctly read unicode strings from subprocess?

Question

I am having problems with subprocesses in Python which return unicode characters, especially the German ü, ä, ö characters.

My script basically wants to open a subprocess, which returns some strings with the stdout.read() function. Some of those strings may contain unicode characters, but it is not always known if and where those characters are. So the output has to be decoded (or encoded?) somehow to correctly display the string. A byte-object is not possible for me to work with.

The following code displays in short what I try to do, but fails to decode the string, hence the "UnicodeDecodeError: 'utf-8' codec can't decode byte 0x81 in position 12: invalid start byte" Error-Message:

import subprocess

command_array = ['echo', 'string_with_ü_ä_ö']
command = subprocess.Popen(command_array, stdout=subprocess.PIPE, shell=True)

command_output = command.stdout.read()
command_output = command_output.decode()
print(command_output)

I feel that there has to be some trivial solution to this, which I failed to find anywhere. Is there any way to correctly return those unicode characters in a string?

I am using Python 3.6.3, and the above script runs on Windows. A version which works under Linux as well will be equally appreciated!

Are you sure the encoding is utf-8 and not iso-8859-1 for example? — Daniel Roseman, Nov 13 '18 at 11:15
Note also, once you do find the correct encoding, you can pass it to Popen as the `encoding` argument and then it will be automatically decoded to str for you. — Daniel Roseman, Nov 13 '18 at 11:16
Passing in a list of tokens is fundamentally incompatible with `shell=True`, though it probably happens to work more or less by mistake on your platform. — tripleee, Nov 13 '18 at 11:31
Thanks for the replies! So it is not iso-8859-1, as it only returns empty spaces instead of the chatacters. Is there any way to find the correct encoding, or do I have to try manually? — Johannes M., Nov 13 '18 at 11:35
If the encoding of your Python script is something else than UTF-8, you obviously aren't asking the shell to `echo` UTF-8 characters. If you have everything set up for UTF-8, you should be fine. — tripleee, Nov 13 '18 at 11:36
Yes, the `shell=True` thing was a workaround for Windows. I don't use it with Linux. — Johannes M., Nov 13 '18 at 11:37

score 1 · Answer 1 · answered Nov 13 '18 at 11:34

1

With Python >= 3.6, you want subprocess.run() with universal_newlines=True

import subprocess

command_array = ['echo', 'string_with_ü_ä_ö']
result = subprocess.run(command_array,
    stdout=subprocess.PIPE, universal_newlines=True)
print(result.stdout)

In Python 3.7 the universal_newlines alias was replaced with text which better explains what the option actually does.

answered Nov 13 '18 at 11:34

tripleee

175,061
34
275
318

For (much) more on what all of this means, see also https://stackoverflow.com/a/51950538/874188 – tripleee Nov 13 '18 at 11:34
Maybe your Windows code page is something odd? https://stackoverflow.com/questions/31469707/changing-the-locale-preferred-encoding-in-python-3-in-windows – tripleee Nov 13 '18 at 11:37
I tried that (with the addition of `shell=True`) and it gives me the following Error: "UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 12: character maps to – Johannes M. Nov 13 '18 at 11:54
`shell=True` is decidedly and utterly wrong here, as already detailed in a previous comment. Can you figure out what character set your Windows is producing, and what encoding the Python script is in, as already asked above? – tripleee Nov 13 '18 at 12:19
Oh, if it's just to get `echo` because you don't have an external tool with that name, replace the `echo` with something else. I imagine it's just a placeholder for something much more complex. For the sake of argument, try `['python', '-c', 'print("Hällö")']`instead of adding the wickedness that is the Windows command shell. – tripleee Nov 13 '18 at 12:24
That actually works in this case. The `echo` command, however, was a placeholder for a different command I am using. In the specific case I need it for, I call a different program with various parameters which lists filenames and IDs, which may or may not contain ö, ä, ü and the likes. Using your approach with this, I now get "UnicodeEncodeError: 'ascii' codec can't encode character u'\xfc' in position 11: ordinal not in range(128)" – Johannes M. Nov 13 '18 at 15:33

score 1 · Accepted Answer · answered Nov 13 '18 at 12:24

1

I have found by trial and error that decoding with cp850 works and yields the expected output:

import subprocess

command_array = ['echo', 'string_with_ü_ä_ö']
command = subprocess.Popen(command_array, stdout=subprocess.PIPE, shell=True)

command_output = command.stdout.read()
command_output = command_output.decode('cp850')
print(command_output)

If you save the above code as a utf8 encoded file (the default for python3 regardless the platform) and run it with python3 it prints:

string_with_ü_ä_ö

Unfortunately I don't know where or why this particular encoding is chosen so this might not work with different setups but at least I am confident it will with yours.

answered Nov 13 '18 at 12:24

Stop harming Monica

12,141
1
36
56

In the general case, there is no guarantee or expectation that this particular code page is the right one for your system. See (again) https://stackoverflow.com/questions/31469707/changing-the-locale-preferred-encoding-in-python-3-in-windows – tripleee Nov 13 '18 at 17:05
@tripleee All I know is that in my box the output comes encoded with that particular encoding (or a similar one) in this an other similar cases. The error message tells me that it is the same for the OP.As I said I have no idea where this encoding comes from or what it depends on. I don't have my windows box handy but I am pretty sure this is not the encoding used by python `open`, that is some variant of `latin`. – Stop harming Monica Nov 13 '18 at 18:14
this is amazing. i was trying to read the git log from https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-4.14.y&id=d354d9afe923eb08f7ee89128b38ddb6415de01d and it choked due to the character in the return from git. – Piotr Sep 23 '22 at 16:45

Python UnicodeDecodeError - How to correctly read unicode strings from subprocess?

2 Answers2

Linked

Related