0

Before getting into the problem, I would like to inform that I saw a lot of StackOverflow questions and python bugs reported on this problem but I am unable to root cause the issue

I am getting UnicodeEncodingError in a centos machine. Python is not built in the machine but the virtual environment with the required python version (3.6.7) is built somewhere else and copied here. So while starting the server, we activate the virtual environment and start the server.

the issue is observed in two scenarios

  1. logging input request parameter which has Unicode characters in it
  2. we pipe print statements to a log file and i can see error there while trying to print this Unicode string through code

the error looks as follows

print("\u6211\u7684\u7535\u8111\u603b\u662f\u51fa\u73b0Windows\u9700\u8981\u6fc0\u6d3b")
UnicodeEncodeError: 'ascii' codec can't encode characters in position 56-63: ordinal not in range(128)

I verified following through python terminal

  • sys.getdefaultencoding() - utf-8
  • sys.getfilesystemencoding() - utf-8
  • sys.stdout.encoding
  • LANG is set to en_us.utf-8
  • LC_ALL is not set

I went through some solutions asking to modify LC_ALL or adding PYTHONIOENCODING in environment variables but I am not sure about modifying those without knowing side effects as the environment is a production environment.

Edit - I tried to print the same set of characters which are breaking the code on above attempts through console by opening python terminal and its printing them without any issue. Tried printing in this way

import sys
print("日本語")
sys.stdout.write("日本語\n")

but through code, it is raising UnicodeEncodingError

I would like to know how to resolve this?

Thanks

Satyaaditya
  • 537
  • 8
  • 26
  • Can you try `sys.stdout = codecs.getwriter('utf8')(sys.stdout)` before the `print` command is called? (And eventually for `sys.stderr`, too.) – pschill Feb 05 '20 at 07:53
  • u want me to do this before starting any printing or logging in code? because, when I am trying to print the same Unicode characters through python terminal, there are no issues. – Satyaaditya Feb 05 '20 at 08:09
  • Call above command in your Python script before you call `print(...)`. It should change `sys.stdout` to a `utf8` compatible stream and might fix your error. – pschill Feb 05 '20 at 08:12
  • Similar https://stackoverflow.com/a/57224678/5320906 – snakecharmerb Feb 05 '20 at 09:06

2 Answers2

1

most ascii terminals cannot render unicode characters (you could try changing the font... maybe that would work) ... so even if you get past your encoding error your print will probably look like �������Windows�������

if you run it in idle it would work ...

i would strongly recommend just print(repr(string_that_might_have_unicode)) as that will guarantee an ascii printable representation ... and nothing is worse than crashing your application because you were trying to print some debug information ... (printing the repr will something more like appear like b"'\\u6211\\u7684\\u7535\\u8111\\u603b\\u662f\\u51fa\\u73b0Windows\\u9700\\u8981\\u6fc0\\ u6d3b'"

you could also try to encode it manually before printing it

print(my_unicode_string.encode("utf8"))

that might work ... in some terminals ... but really ... just print the repr unless you are showing that to the user (but since you talk about server i imagine this to not be a terminal client application, but debug information that is being printed(and redirected to a logfile?))

if you really need to print the exact unicode to the terminal instead of the repr then i think you need to do the manual decode step to send utf8 to the actual terminal ... but its much easier to just always print the repr when logging (this has the benefit of showing you invisible and whitespace characters... but not great if its part of a client application)

Joran Beasley
  • 110,522
  • 12
  • 160
  • 179
  • thanks for quick reply @Joran, I am doing two things here, firstly logging the query string (unicodes are part of this) in log file and we print some things on console(like the unicode string when a request comes) which are redirected to a log file. i will edit my question for better understanding – Satyaaditya Feb 05 '20 at 07:57
  • i made changes to my question for better understanding – Satyaaditya Feb 05 '20 at 08:04
  • "most ASCII terminals cannot render Unicode characters" - it is working when I try that in the terminal, but through code, it is throwing encoding error – Satyaaditya Feb 07 '20 at 06:04
  • idle can indeed print them (assuming thats what you meant by open the terminal ... ) some linux terminals will also print them ... however try opening "cmd.exe" and running python and that sys.stdout command – Joran Beasley Feb 07 '20 at 07:10
1

Finally got rid of this issue in this way

I observed the issue mentioned in question under two different circumstances

The first scenario - With all settings posted in the question, all language-related encodings are UTF-8, it worked after our prod server restart without any changes. Still don't know what made it not to work previously and work after restarting the machine.

The second scenario - All LC variables are set to POSIX in our client environment. I went through many solutions which suggested to modify LANG or LC_ALL to UTF-8. But changing all the encoding configurations may lead to problems like date time conversion etc... which are locale-based.

Fix - only changed LC_CTYPE to UTF-8 in our case it is en_US.UTF-8

export LC_CTYPE="en_US.UTF-8"

and it worked.

Satyaaditya
  • 537
  • 8
  • 26