1

Very beginner question. While trying to run the script written in LP3THW Ex.23, PowerShell isn't displaying foreign characters. I'm assuming it has to do with UTF16 / UTF8 encoding but I can't figure it out from other posts on stack overflow.

Here is the script:

import sys
script, input_encoding, error = sys.argv


def main(language_file, encoding, errors):
    line = language_file.readline()
    
    if line:
        print_line(line, encoding, errors)
        return main(language_file, encoding, errors)
        
        
def print_line(line, encoding, errors):
    next_lang = line.strip()
    raw_bytes = next_lang.encode(encoding, errors=errors)
    cooked_string = raw_bytes.decode(encoding, errors=errors)
    
    print(raw_bytes, "<===>", cooked_string)
    
    
languages = open("languages.txt", encoding="utf-8")

main(languages, input_encoding, error)

The text file contents (Languages.txt) can be seen here: https://learnpythonthehardway.org/python3/languages.txt

Image of PowerShell terminal when running script: Image

Links of other posts which make me even more confused:

Changing PowerShell's default output encoding to UTF-8

UTF8 Script in PowerShell outputs incorrect characters

mklement0
  • 382,024
  • 64
  • 607
  • 775

1 Answers1

1

There are several problems:

  • Unicode character rendering:

    • The default font in regular console windows is limited in terms of the Unicode characters it can display, and many of those present in your sample file are not supported.

    • While you can try to switch to a different font that (hopefully) can render all the characters you need - as described in one of the answers you link to - consider switching to Windows Terminal, installable from the Microsoft store: it provides support for a much wider range of characters by default.

  • PowerShell's interpretation of UTF-8 text files without a BOM:

    • In Windows PowerShell - which is what you're using, judging by the screen shot - BOM-less text files are assumed to be ANSI-encoded, i.e. to be encoded with the legacy ANSI code page based on your machine's system locale (language for non-Unicode programs), such as Windows-1252 on US-English systems.

    • PowerShell (Core) 7+, by contrast, now commendably assumes UTF-8, and generally uses BOM-less UTF-8 as the consistent default (including when writing files).

    • Therefore, to decode the file properly, use Get-Content -Encoding Utf8 languages.txt in Windows PowerShell.

      • Note: This in turn may reveal rendering problems due to lack of support for certain Unicode characters in the active font, but in Windows Terminal you'd see the expected output.
  • Python's output character encoding:

    • If you're only printing directly to the console, your script's content will appear correctly, barring any rendering problems due to unsupported characters. The reason is that Python detects this output scenario and use a Unicode-enabled API to print.

    • More work is needed if you need to further process the output, such as by capturing it in a variable, sending it to another command, or saving it to a file:

      • Python defaults to ANSI(!) encoding on output to stdout, so it must be instructed to output UTF-8 instead, which can you do by setting $env:PYTHONUTF8=1 beforehand or passing -X utf8 on the python / py command line (v3.7+).

      • Complementarily, PowerShell must (temporarily) be instructed to expect UTF-8 output from external programs (instead of the output encoded with the legacy OEM code page), which requires executing [Console]::OutputEncoding = [System.Text.Utf8Encoding]::new()

To put it all together in the form of a sample PowerShell script (.ps1):

# PREREQUISITES:
#  * In a *regular console window*: 
#    Choose a font that supports all characters in language.txt, if possible
#  * Preferably, run from *Windows Terminal*.
#  Additionally, the code assumes:
#    * Windows 10 or higher.
#    * Python 3.7 or higher.

# Download the sample file.
# It contains a list of language names expressed in each language natively,
# therefore containing many non-ASCII-range characters, including CJK ones.
curl.exe -O https://learnpythonthehardway.org/python3/languages.txt

# Print the sample file using a PowerShell command.
# Assuming you've chosen a suitable font or are running from Windows Terminal, 
# all non-ASCII-range should characters correctly.
Get-Content -Encoding Utf8 languages.txt

pause

# Invoke your Python script file and let it *print directly to the console*.
# Again, this should render the non-ASCII-range characters correctly.
python script.py utf8 strict

pause

# Invoke it again, but with further processing, which requires
#  * requesting that Python use UTF-8
#  * making PowerShell expect UTF-8

# (Temporarily) tell PowerShell to expect UTF-8 stdout output 
# from external programs.
$prevEncoding = [Console]::OutputEncoding
[Console]::OutputEncoding = [System.Text.Utf8Encoding]::new()

# Invoke the Python script, telling Python to output UTF-8 to stdout.
# Select-Object -Firt 10 limits the output to the first 10 lines.
# Note that this operation alone involves decoding of Python's output by PowerShell.
# Again, this should render the non-ASCII-range characters correctly.
python -X utf8 script.py utf8 strict | Select-Object -First 10

[Console]::OutputEncoding = $prevEncoding
mklement0
  • 382,024
  • 64
  • 607
  • 775