0

I am having trouble displaying Unicode characters on git-bash when working with Python's logging.

  • Without logging - everything works fine.
  • With cmd - everything works fine.
  • With PyCharm - everything works fine.
  • logging with git-bash - not working (displaying "\u2501").

See below...

I am using:

  • Windows 10
  • git version 2.37.2.windows.2
  • Python 3.9.6 (Same issue with Python 3.7).

enter image description here

enter image description here

enter image description here And even this not working:

enter image description here

torek
  • 448,244
  • 59
  • 642
  • 775
Elad Weiss
  • 3,662
  • 3
  • 22
  • 50
  • What's the output of `python -c "import sys; print(sys.stderr.encoding)"`? – Thomas Aug 25 '22 at 08:42
  • @Thomas cp1252 . – Elad Weiss Aug 25 '22 at 08:43
  • 2
    Some terminals can't print certain unicode (or any unicode in some cases). This also depends on the font being used (which can be changed). Although often a question mark symbol or similar is printed (not `\u2501`) if the glyph is missing. Try `echo $'\u2501'` in git-bash to verify if the terminal can print this character or not. Also, in my experience, unicode behaviour is fairly inconsistent across different terminals, in respect of width, alignment, and size of glyphs. – dan Aug 25 '22 at 09:03
  • 1
    For people who want to test (and do not have OCR): `python -c "import logging; logging.error('L\u2501'); print('P\u2501')"` – Giacomo Catenazzi Aug 25 '22 at 11:53
  • @GiacomoCatenazzi Right! Good comment. Will pay attention from now on. – Elad Weiss Aug 25 '22 at 12:14
  • 1
    Side note: Git is not relevant here; git-bash is, but only in that it means you're using some kind of terminal emulation software, which in turn determines which character encoding you get. – torek Aug 25 '22 at 20:22

2 Answers2

1

Python's output encoding is set to cp1252, which can't encode the character U+2501:

>>> '\u2501'.encode('cp1252')
...
UnicodeEncodeError: 'charmap' codec can't encode character '\u2501' in position 0: character maps to <undefined>

Normally, if you try to print this character, you would get a similar error. But since you're implicitly using logging.basicConfig, the default value for errors is 'backslashreplace', which explains why you're seeing backslash-escaped output instead of an error.

There are many ways to get Python to output UTF-8 instead, but don't do that: your terminal expects CP1252, so everything outside the ASCII range will be wrong.

Instead, see this answer for a way to get Git Bash to use UTF-8 instead.

Thomas
  • 174,939
  • 50
  • 355
  • 478
  • On Windows (but it will change), the default encoding of Python is not UTF-8 (it is not just for `print` but also to read files). IIRC locales are used only on Unix variants of Python – Giacomo Catenazzi Aug 25 '22 at 09:45
  • @GiacomoCatenazzi Then why does @EladWeiss report that it works fine in `cmd`? – Thomas Aug 25 '22 at 11:36
  • Maybe Python changed. In fact until few version ago python could not properly run there (there were pythonw, or other tricks). In any case, with locale with UTF-8 and git-bash, the locale is cp1252 (you can test just by using `print` instead of logging, so we have an error with the current locale). – Giacomo Catenazzi Aug 25 '22 at 11:53
1

Well, as commented, logging is using sys.stderr and not sys.stdout.

As for the terminals (at least on my Windows machine):

  • When running Python from cmd the default encoding for both sys.stdout and sys.stderr is utf-8.
  • When running Python from git-bash the default encoding for both sys.stdout and sys.stderr is cp1252. Really no idea why. I would assume most git-bash users want Linux-like behavior.

It seems that both terminals accept utf-8.

Since my script only needs to be run on these two terminals, the following solves my problem:

import sys
sys.stderr.reconfigure(encoding='utf-8')
Elad Weiss
  • 3,662
  • 3
  • 22
  • 50