UTF-8 problems in Python

Question

I have googled this and tried every single solution that I have found and nothing is working. I am using Python3. I read a string from a form and try to print/write it. Everything is fine unless it contains non-ascii characters (I am testing with Greek text).

form = cgi.FieldStorage()
name = form.getvalue("Name")
sys.stderr.write(name)
print(name)

The write outputs the Unicode encoding (e.g. \u03bc\u03b5\u03c4\u1f70) which is not what I want, and the print crashes with a

UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-3: ordinal not in range(128)

So for some reason print and write treat it differently, which is just weird.

Here is everything I have tried to get it to print out the text in its original form (as Greek letters):

print(name.encode("UTF-8"))

This prints it in the wrong format similar to what write did above. The following all crash with the same/similar error:

print(name.encode("UTF-8").decode("UTF-8")) # crashes with same error

ba = bytearray(name,"UTF-8")
n2 = ba.decode("UTF-8")
print(n2) # also crashes

unic = u'' # Nope. Errors still.
unic +=name
print(unic) # also crashes

print(b'{name}') #Prints b'{name}' literally.

If I run similar code locally (instead of on a webserver and getting the string as a response), everything works fine. Somehow the string I am getting back is acting differently and I cannot for the life of me figure out why.

So what very simple thing am I missing here?

In case it is relevant, I executed locale (I am using Centos 7) and get the following:

LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=

Update I printed sys.stdout.encoding from the script and it returns ANSI_X3.4-1968, so that may be the problem. Strangely, when I run the same command from a python3 command line prompt I get UTF-8. Now I I guess I need to figure out how to set the encoding for when it runs from the webserver.

Update 2 I added the following:

A = subprocess.run(["locale"])
print(A.stdout)

And the output is:

LANG= 
LC_CTYPE="POSIX" 
LC_NUMERIC="POSIX" 
LC_TIME="POSIX" 
LC_COLLATE="POSIX" 
LC_MONETARY="POSIX" 
LC_MESSAGES="POSIX" 
LC_PAPER="POSIX" 
LC_NAME="POSIX" 
LC_ADDRESS="POSIX" 
LC_TELEPHONE="POSIX" 
LC_MEASUREMENT="POSIX" 
LC_IDENTIFICATION="POSIX" 
LC_ALL= None

so clearly the encoding is set differently when I run from Apache than from the command line. Hmmm...

Update 3 I tried adding the following lines to /etc/sysconfig/httpd and restarted apache, but no change. (The first two were suggested in one place, the third in another, although none of the sources said WHERE to put these. For Centos 7, the file I tried seemed to be the logical one, but obviously not?)

export LANG='en_US.UTF-8'
export LC_ALL='en_US.UTF-8'
export PYTHONIOENCODING=utf-8

Update 4 I tried locale.setlocale(locale.LC_ALL,'en_US.UTF-8') in my script and it didn't help.

Also, strangely, there is the following in my old error_log files for httpd, but not in the current one (so the last time this printed was several days ago):

Fatal Python error: Py_Initialize: Unable to get the locale encoding

This seems to track with what I am seeing--the environment variables are not being used/seen when my scripts run from Apache.

Update 5 I found a hack that works. Instead of running my python script directly from Apache, I run runMakeArt which is as follows:

#!/usr/bin/sh
export PYTHONIOENCODING=utf-8 ; /usr/bin/python3 makeArt.py

and so far this seems to be working. In some sense maybe this is better than properly configuring Apache since if I move servers (hopefully not!), this should still work without worrying about whether or not Apache is configured correctly.

Where did you execute `locale`? Locally or on the webserver? — Sören, Apr 29 '22 at 19:25
And your scripts and run via apache or some other httpd? Under a separate user? Does that user have the same locale settings? Try adding `import subprocess; subprocess.run(["locale"])` to one of your scripts and edit the output into your question — Sören, Apr 29 '22 at 19:34
@Sören Well that sucks. I have Python 3.6 currently, and that solution requires Python 3.7. I guess maybe on Monday I will try to update to the latest Python so I can mess everything else up and then eventually fix this problem. The joys of technology. — ferzle, Apr 29 '22 at 20:09
Probably you don't need that. The 3.7 feature is that you can change it programmatically from within Python itself. But all you really need is to make sure you start up Python in a way which enables it to write Unicode (or generally anything more than just 7-bit ASCII) to standard output. — tripleee, Apr 29 '22 at 20:13
I'm currently trying to figure out how to set `PYTHONIOENCODING=utf-8` in such a way that it is set for the Apache user and hoping that fixes it. — ferzle, Apr 29 '22 at 20:15
Possible duplicate of https://stackoverflow.com/questions/47844627/django-unicodedecodeerror-only-on-apache-nginx — tripleee, Apr 29 '22 at 20:19
A suggestion, it usually helps to provide `print(repr(name))` because then, at least if `name` is a built-in object, we can unambiguously determine what it is. — juanpa.arrivillaga, Apr 29 '22 at 20:23
@tripleee In Centos it appears to be in a different place. I can't find envvars files anywhere on my system so far. — ferzle, Apr 29 '22 at 20:26
You don't necessarily need to find the OS defaults for interactive use; just set the variable manually to what you want it to be. https://httpd.apache.org/docs/2.2/mod/mod_env.html#setenv — tripleee, Apr 30 '22 at 05:01
On a fresh Docker `centos` image, I find `/etc/locale.conf` but it's unremarkable, and simply contains `LANG="en_US.UTF-8"` — tripleee, Apr 30 '22 at 05:03

UTF-8 problems in Python

0 Answers0