2

I'm currently working on a project where I need to run a command in powershell, and part of the output is not in English (Specifically - Hebrew).

For example (a simplified version of the problem), if I want to get the content of my desktop, and there is a filename in Hebrew:

import subprocess
command = "powershell.exe ls ~/Desktop"
print (subprocess.run(command.split(), stdout=subprocess.PIPE).stdout.decode())

This code will raise the following error (Or something similar with a different byte value):

UnicodeDecodeError: 'utf8' codec can't decode byte 0x96 in position 19: invalid start byte

Tried to run it on a different computer, and this was the output:

?????

Any idea why is that and how can I fix it? Tried a lot of things I saw on other questions, but none of them worked for me.

davidalk
  • 87
  • 1
  • 1
  • 6
  • 1
    Try to use `decode()` with `encoding` parameter, for example `decode(encoding="latin1")` – Funpy97 May 31 '21 at 13:01
  • Output character encoding is dependant on your system/os/shell settings. If you get the UnicodeDecodeError, it means that the output captured is *NOT* unicode. You might be able to fetch the encoding with `locale.getpreferredencoding()` and use that as parameter to `decode()` as @Marino pointed out above. – rasjani May 31 '21 at 13:12
  • 2
    @Marino Latin-1 doesn't support Hebrew. Decoding will succeed (because *any* byte sequence can be decoded with Latin-1), but the result probably will be garbage. – lenz May 31 '21 at 13:12
  • Thank you for your comments. Unfortunately - none of them worked :( The command output in python I think is literally the char `?`, not really sure why. – davidalk May 31 '21 at 13:40
  • 1
    Can you give some example file names that you are having issues with? – HAL9256 May 31 '21 at 14:25
  • `קובץראשון.txt` for example – davidalk May 31 '21 at 14:45
  • `b'\x96'.decode('cp862')` returns `'צ'` (_Hebrew Letter Tsadi_). Please share raw bytes `print (subprocess.run(command.split(), stdout=subprocess.PIPE).stdout)`. – JosefZ May 31 '21 at 16:40

1 Answers1

2

Note: The following Python 3+ solutions work in principle, however:

  • Due to a bug in powershell.exe, the Windows PowerShell CLI, the current console window switches to a raster font (potentially with a different font size), which does not support most non-extended-ASCII-range Unicode characters. While visually jarring, this is merely a display (rendering) problem; the data is handled correctly; switching back to a Unicode-aware font such as Consolas reveals the correct output.

  • By contrast, pwsh.exe, the PowerShell (Core) (v6+) CLI does not exhibit this problem.


Option A: Configure both the console and Python to use UTF-8 character encoding before executing your script:

  • Configure the console to use UTF-8:

    • From cmd.exe, by switching the active OEM code page to 65001 (UTF-8); note that this change potentially affects all later calls to console applications in the session, independently of Python, unless you restore the original code page (see Option B below):

      chcp 65001
      
    • From PowerShell:

      $OutputEncoding = [Console]::InputEncoding = [Console]::OutputEncoding = [System.Text.UTF8Encoding]::new()
      
  • Configure Python (v3+) to use UTF-8 consistently:[1]

    • Set environment variable PYTHONUTF8 to 1, possibly persistently, via the registry; to do it ad hoc:

      • From cmd.exe:

        Set PYTHONUTF8=1
        
      • From PowerShell:

        $env:PYTHONUTF8=1
        
    • Alternatively, for an individual call (v3.7+): Pass command-line option -X utf8 to the python interpreter (note: case matters):

        python -X utf8 somefile.py ...
      

Now, your original code should work as-is (except for the display bug).


Option B: Temporarily switch to UTF-8 for the PowerShell call:

import sys, ctypes, subprocess

# Switch Python's own encoding to UTF-8, if necessary
# This is the in-script equivalent of setting environment var. 
# PYTHONUTF8 to 1 *before* calling the script.
sys.stdin.reconfigure(encoding='utf-8'); sys.stdout.reconfigure(encoding='utf-8'); sys.stderr.reconfigure(encoding='utf-8')

# Enclose the PowerShell call in `chcp` calls:
#   * Change to the UTF-8 code page (65001), 
#   * Execute the PowerShell command (which then outputs UTF-8)
#   * Restore the original OEM code page.
command = "chcp 65001 >NUL & powershell ls ~/Desktop & chcp " + str(ctypes.cdll.kernel32.GetConsoleOutputCP()) + ' >NUL'

# Note: 
#  * `shell=True` ensure that the command is invoked via cmd.exe, which is
#     required, now that we're calling *multiple* executables and use output
#     redirections (`>NUL`)
print(subprocess.run(command.split(), stdout=subprocess.PIPE, shell=True).stdout.decode())

[1] This isn't strictly necessary just for correctly decoding PowerShell's output, but matters if you want to pass that output on from Python: Python 3.x defaults to the active ANSI(!) code page for encoding non-console output, which means that Hebrew characters, for instance, cannot be represented in non-console output (e.g., when redirecting to a file), and cause the script to break.

mklement0
  • 382,024
  • 64
  • 607
  • 775