0

I'm trying to print a Unicode character from Python 3 to the web. In Python I can run:

print("Content-Type: text/html; charset=utf-8\n")
print("\u00EA")

When run from the command line it correctly spits out:

Content-type: text/html; charset=utf-8

ê

But when run from the web as a CGI script under Apache, it throws an error:

UnicodeEncodeError: 'ascii' codec can't encode character '\xea' in position 0: ordinal not in range(128)

Any suggestions on how to get Python 3 to print UTF-8 to the web? Thanks!

Edit: The output of locale in both my account and www-data (Apache's account) is:

LANG=en_US.UTF-8
LANGUAGE=
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=
Neil Fraser
  • 993
  • 1
  • 7
  • 12
  • Just don't use `printf`? Configure the OS locale to use UTF8? – Panagiotis Kanavos Sep 23 '20 at 06:58
  • BTW this page just like every other web page is UTF8. No special handling is required, that's why you can read those Vietnamese characters without having to encode them or contact SO's team to ask for special treatment. – Panagiotis Kanavos Sep 23 '20 at 07:00
  • 1
    What are the locale settings for the user running the apache process ? – snakecharmerb Sep 23 '20 at 07:02
  • 1
    The problem is *printf* and the console. You're writing to the console, not a file, which means Python has to use the console's encoding. If your environment is configured to use Latin1 or worse, US-ASCII, the text will be mangled. On Linux, this is controlled by `LC_ALL`. Windows is natively Unicode *BUT* the *console GUI* is locale specific and needs a setting to display UTF8. That shouldn't affect process-to-process communication though – Panagiotis Kanavos Sep 23 '20 at 07:03
  • LC_ALL is the empty string in my account. Not sure how to look it up in the www-data account since its shell is `/usr/sbin/nologin`. OS is Ubuntu. The error is thrown on the `print` statement, `read` executes fine. – Neil Fraser Sep 23 '20 at 07:13
  • `sudo su www-data` then `locale`. Please [edit] the question to show the _complete_ output of `locale`, not just `LC_ALL`. – snakecharmerb Sep 23 '20 at 07:15
  • What I suspect is happening is that the apache user has an ASCII locale, and you may need to change this to handle UTF-8. There might be an apache setting to handle this, but given the nature of CGI I suspect that's unlikely. – snakecharmerb Sep 23 '20 at 07:17
  • Added `locale` output for the www-data account (had to give it a shell so I could `su` to it). – Neil Fraser Sep 23 '20 at 07:22
  • 1
    This answer looks like what you want https://stackoverflow.com/a/19574801/5320906. Option 4 (pass through Apache's `$LANG` looks like the best approach, fiddling with `sys.defaultencoding` is suggested in other answers but that's pretty hacky imo. – snakecharmerb Sep 23 '20 at 07:58
  • Yes, this is the issue, thanks! `sys.stdout.encoding` returns `UTF-8` on the command line, but `ANSI_X3.4-1968` when run from Apache. Working on how to change this... – Neil Fraser Sep 23 '20 at 14:35
  • Adding the following to the Apache config solves the issue: `SetEnv LANG en_US.UTF-8` – Neil Fraser Sep 23 '20 at 14:47

3 Answers3

1

You have to encode the data to utf-8 explicitly (otherwise Python tries to guess, and in your case it guessed ASCII, which did not work out well). So, do this:

sys.stdout.buffer.write(text.encode('utf-8'))

This will fix your error. Note that I am using sys.stdout.buffer.write rather than print, because buffer.write can handle raw bytes (and UTF-8 is raw byte array, not a string).

In addition to that, you should tell the client (browser) that the data is served as utf-8 (otherwise the browser will also have to guess, which may succeed, but it is better to be explicit), e.g.

print("Content-Type: text/html; charset=utf-8\n")
zvone
  • 18,045
  • 3
  • 49
  • 77
  • That's one of the things I tried, but results in this on both the command line and the web browser: b'L\xc3\xaa Qu\xc3\xbd \xc4\x90\xc3\xb4n\n' – Neil Fraser Sep 23 '20 at 06:58
  • Python 3 strings are Unicode already. This is more of an OS/console problem. Perhaps the OS's locale is configured to use ASCII? If on Linux, how is LC_ALL set? – Panagiotis Kanavos Sep 23 '20 at 06:58
  • @NeilFraser Yes, I wrote the code too fast and did not check. It can't work with `print`. I fixed the answer, try with that. – zvone Sep 23 '20 at 10:40
  • @PanagiotisKanavos Python 3 strings are unicode, they are not UTF-8. They need to be converted to UTF-8 byte arrays. `print` will do that automatically by inspecting the encoding of `stdout`, which is fine for just printing to the console. Generating HTTP response should be more explicit about encodings. – zvone Sep 23 '20 at 10:43
1

Thanks to the feedback from users here, I was able to piece together a solution:

  1. The Content-Type line must include charset=utf-8.
  2. Apache's configuration file must include SetEnv LANG en_US.UTF-8.

A great debugging tool was to print the value of sys.stdout.encoding, it should return "UTF-8", not "ANSI_X3.4-1968".

Neil Fraser
  • 993
  • 1
  • 7
  • 12
0

When you read a file use context manager.

Behind scene opening and closing file is done for you so you don't have to remember it.

with open(filename , encoding='utf-8') as f:
    text = f.read()
print(text)
woblob
  • 1,349
  • 9
  • 13