Python: How to decode file names retrieved from 'dir' command using subprocess?

Question

I am trying to get directory listing on Windows 10 file system using the subprocess.Popen function and dir command in Python 3.8.2. To be more specific, I have this piece of code:

import subprocess

process = subprocess.Popen(['dir'], shell = True, stdout = subprocess.PIPE, stderr = subprocess.STDOUT)
for line in iter(process.stdout.readline, b''):
  print(line.decode('utf-16'))
process.stdout.close()

When I run the above in a directory that has file names with Unicode characters (such as "háčky a čárky.txt"), I get the following error:

Traceback (most recent call last):
  File "error.py", line 5, in <module>
    print(line.decode('utf-16'))
UnicodeDecodeError: 'utf-16-le' codec can't decode byte 0x0a in position 42: truncated data

Obviously, the problem is with the encoding. I have tried using 'utf-8' instead of 'utf-16', but with no success. When I remove the decode('utf-16') call and use just print(line), I get the following output:

b' Volume in drive C is OSDisk\r\n'
b' Volume Serial Number is 9E2B-67E3\r\n'
b'\r\n'
b' Directory of C:\\Users\\asamec\\Dropbox\\DIY\\Python\\AccessibleRunner\\AccessibleRunner\r\n'
b'\r\n'
b'05/14/2021  09:19 AM    <DIR>          .\r\n'
b'05/14/2021  09:19 AM    <DIR>          ..\r\n'
b'05/13/2021  09:46 PM             5,697 AccessibleRunner.py\r\n'
b'05/14/2021  09:18 AM               214 error.py\r\n'
b'05/13/2021  05:48 PM             5,642 h\xa0cky a c\xa0rky.txt.py\r\n'
b'               3 File(s)         11,553 bytes\r\n'
b'               2 Dir(s)  230,706,778,112 bytes free\r\n'

When I remove the 'utf-16' argument and leave just print(line.decode()), I get the following error:

Traceback (most recent call last):
  File "error.py", line 5, in <module>
    print(line.decode())
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa0 in position 40: invalid start byte

So the question is how should I decode the processes' standard output so that I can print the correct characters?

Update

Running the chcp 65001 command in the Windows command line before running the python script is the solution. But, the following gives me the same error s above:

import subprocess

process = subprocess.Popen(['cmd', '/c', 'chcp 65001 & dir'], shell = True, stdout = subprocess.PIPE, stderr = subprocess.STDOUT)
for line in iter(process.stdout.readline, b''):
  print(line.decode('utf-16'))
process.stdout.close()

However, when running this same Python script for the second time, it starts to work as the code page is already set to 65001. So the question now is how can I set the Windows console code page not prior to running the Python script, but rather in that Python script?

There are plenty more [direct ways](https://stackoverflow.com/questions/2759323/how-can-i-list-the-contents-of-a-directory-in-python) to get the contents of a directory than trying to parse the `stdout` of `dir` - why mess around with the funny edge cases of this method? — esqew, May 13 '21 at 18:14
I am building a simple command line and `dir` is just an example of a command that could be run in that tool. — Adam, May 13 '21 at 18:50
What if you use `print(line) ### .decode('utf-16'))`? Please include that info for `"háčky a čárky.txt"` to your [mcve]. For me it's UTF-8 `b'h\xc3\xa1\xc4\x8dky a \xc4\x8d\xc3\xa1rky.txt\r\n'` because my `REG QUERY "HKLM\SYSTEM\CurrentControlSet\Control\Nls\CodePage" -v *CP` returns `65001` in `ACP` as well as `OEMCP`; yours could be different… `print(line.decode())` should work. — JosefZ, May 13 '21 at 21:05
@JosefZ I have updated the question to address your suggestions. — Adam, May 14 '21 at 08:08
Do you have set the [`PYTHONIOENCODING`](https://docs.python.org/3/using/cmdline.html#envvar-PYTHONIOENCODING) environment variable? Mine is `PYTHONIOENCODING=utf-8`. — JosefZ, May 14 '21 at 11:18
Setting the env. var. using `os.environ['PYTHONIOENCODING'] = 'utf-8'` gives me error: `UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa0 in position 40: invalid start byte` — Adam, May 14 '21 at 14:13

score 0 · Answer 1 · answered May 14 '21 at 10:46

Set console to UTF-8 before running the script (use CHCP 65001):

The script runs smoothly then: .\SO\67524114.py

Active code page: 65001
HL~Real~Def.txt
html.txt
háčky a čárky.txt

I can reproduce the issue using the following call:

>NUL chcp 852
.\SO\67524114.py

Active code page: 852
HL~Real~Def.txt
html.txt
Traceback (most recent call last):
  File "D:\bat\SO\67524114.py", line 7, in <module>
    print(line.decode('utf-8').strip())
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa0 in position 1: invalid start byte

Modified script used for testing:

import subprocess

process = subprocess.Popen(['cmd', '/c', 'chcp&dir /B h*.txt'], shell = True, stdout = subprocess.PIPE, stderr = subprocess.STDOUT)
for line in iter(process.stdout.readline, b''):
  print(line.decode('utf-8').strip())

process.stdout.close()

Thanks, this is almost the solution, however not exactly. Please, see the updated question. — Adam, May 14 '21 at 18:40

score 0 · Accepted Answer · answered May 14 '21 at 22:36

As @JosefZ suggested in his answer, the UTF-8 code page must be set in the Windows command line prior to running the dir command. Below is the complete solution for my question:

import subprocess

subprocess.call(['chcp', '65001'], shell = True)
process = subprocess.Popen(['dir'], shell = True, stdout = subprocess.PIPE, stderr = subprocess.STDOUT)
for line in iter(process.stdout.readline, b''):
  print(line.decode('utf-8'))
process.stdout.close()

Brainor · Answer 3 · 2022-03-01T08:20:24.457

Since 2016.9, module subprocess version 3.6 has encoding parameter in function subprocess.run(), so that you can set specified encoding.

So, if you don't want to change the encoding of the CMD:

Type chcp in your CMD and get the active code page.
e.g. 936.
Get the encoding from Code Page Identifiers.
Identifier(936): .NET Name(gb2312)
gb2312 is the encoding name python can recognize for the most cases. But you can check the Standard Encodings of Python 3.10 to be sure, thanks to Mark Amery.
Add encoding='gb2312' to your subprocess.run() function.
process_list = subprocess.run('dir', shell = True, stdout = subprocess.PIPE, stderr = subprocess.STDOUT, text=True, encoding='gb2312').stdout.split('\n')[:-1]
The subprocess.Popen constructor also has encoding parameter if you really want to stick to Popen, while it's recommended that "The recommended approach to invoking subprocesses is to use the run() function for all use cases it can handle."

If you want to change the encoding of the CMD, refer to the answer by JosefZ.

Python: How to decode file names retrieved from 'dir' command using subprocess?

Update

3 Answers3

Linked