Proper use of unicode characters in python3 - Force utf-8 encoding

Question

I'm going crazy here. The internet and this SO question tell me that in python 3.x, the default encoding is UTF-8. In addition to that, my system's default encoding is UTF-8. In addition to that, I have # -*- coding: utf-8 -*- at the top of my python 3.5 file.

Still, python is using ascii:

# -*- coding: utf-8 -*-
mystring = "Ⓐ"
print(mystring)

Greets me with:

SyntaxError: 'ascii' codec can't decode byte 0xe2 in position 7: ordinal not in range(128)

I've also tried this: print(mystring.encode("utf-8")) and .decode("utf-8") - Same thing.

What am I missing here? How do I force python to stop using ascii encoding?

Edit: I know that it seems weird to complain about position 7 with a one character string, but this is my actual MCVE and the exact output I'm getting. The above is using python shell, the below is in a script. Both use python 3.5.2.

Edit: Since I figured it might be relevant: The string I'm getting comes from an external application and is not hardcoded, so I need a way to get that utf-8 string and save it into a file. The above is just a minimalized and generalized example. Here is my real-life code:

# the variables being a string that might contain unicode characters
mystring = "username: " + fromuser + " | printname: " + fromname
with open("myfile.txt", "a") as myfile:
  myfile.write(mystring + "\n")

Your error message doesn't make sense. If on Python 3 `mystring` is a len 1 Unicode string. There is no way to for `.decode` to be called (Unicode strings are encoded, not decoded) and `position 7` is impossible for a length 1 string. Provide a [mcve]. — Mark Tolonen, Aug 14 '18 at 02:55
If you want to write a file using UTF-8 encoding, use `with open('myfile.txt','a',encoding='utf8') as myfile:`. — Mark Tolonen, Aug 14 '18 at 02:58
How are you calling your Python script? Is there a shell pipeline into or out of Python? — tripleee, Aug 14 '18 at 03:38
@MarkTolonen I thought the same, but this is my MCVE and I did exactly what I provided. It says "position 7" on a one character string, yes. As for writing to a file with utf8, shouldn't it be the default? Everything on my system uses utf8 and so should python. — confetti, Aug 14 '18 at 09:46
@tripleee My first example is in python shell, the latter code is a script. Both use `python 3.5.2` — confetti, Aug 14 '18 at 09:47

sehafoc · Accepted Answer · 2018-08-14T20:10:52.290

In Python3 all strings are unicode, so the problem you're having is likely due to your locale settings not being correct. The Python3 interpreter looks to use the locale environment variables and if it cannot find them it emulates basic ASCII

From locale.py:

except ImportError:

    # Locale emulation

    CHAR_MAX = 127
    LC_ALL = 6
    LC_COLLATE = 3
    LC_CTYPE = 0
    LC_MESSAGES = 5
    LC_MONETARY = 4
    LC_NUMERIC = 1
    LC_TIME = 2
    Error = ValueError

Double check the locale on your shell from which you are executing. Here are a few work arounds you can try to see if they get you working before you go through the task of getting your env setup correctly.

1) Validate UTF-8 locale or language files are installed (see link above)

2) Try adding this to the top of your script

#!/usr/bin/env LC_ALL=en_US.UTF-8 /usr/local/bin/python3
print('カタカナ')

or

#!/usr/bin/env LANG=en_US.UTF-8 /usr/local/bin/python3
print('カタカナ')

Or export shell variables before executing the Python interpreter

export LANG=en_US.UTF-8
export LC_ALL=en_US.UTF-8
python3
>>> print('カタカナ')

Sorry I cannot be more specific, as these settings are platform and OS specific. You can forcefully attempt to set the locale in Python directly using the locale module, but I don't recommend that, and it won't help if they are not installed.

Hope that helps.

First `print` command gives `UnicodeEncodeError: 'ascii' codec can't encode character '\xe2' in position 0: ordinal not in range(128)` - In a python 3.5.2 shell. Second command works, the last 3 don't. — confetti, Aug 14 '18 at 09:50
I think I found the issue thanks to you, my `locale` output reports nothing but `POSIX` - I'm going to add the proper locales and try again. If that was the cause I'll accept your answer. — confetti, Aug 14 '18 at 09:50
Got it. The issue was indeed my `locale` settings. I've had UTF-8 selected as default system-wide locale, however it wasn't installed at all. I'll accept your answer, do you mind adding in some more information about changing the locale or something that will be of help for others with the same problem in the future? — confetti, Aug 14 '18 at 10:06

J. Blackadar · Answer 2 · 2018-08-13T22:59:31.630

0

What's new in Python 3.0 says:

All text is Unicode; however encoded Unicode is represented as binary data

If you want to try outputting utf-8, here's an example:

b'\x41'.decode("utf-8", "strict")

If you'd like to use unicode in a string literal, use the unicode escape and its coded representation. For your example:

print("\u24B6")

edited Aug 13 '18 at 22:59

answered Aug 13 '18 at 22:51

J. Blackadar

1,821
1
11
18

That gives me `UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte`, and is also not quite what I need. I have a string that contains unicode characters, and I simply want to print them (and write them to a file later). – confetti Aug 13 '18 at 22:56
My apologies. See the edit with the example, try using the unicode escape sequence and its numerical code. – J. Blackadar Aug 13 '18 at 23:04
That works now as example, but I need a better solution than that. I'm getting the string containing those unicode characters from an external application, they are not hardcoded. My ultimate goal is to save that string into a file, with utf-8 encoding. – confetti Aug 13 '18 at 23:57
Can you specify an encoding from your source? For example, using io: >>>import io >>>f = io.open("test", mode="r", encoding="utf-8") – J. Blackadar Aug 14 '18 at 15:00
It works now, the issue was that my system had utf-8 as locale set but it wasn't installed properly. Re-generating my locales fixed the issues. – confetti Aug 14 '18 at 15:02

Proper use of unicode characters in python3 - Force utf-8 encoding

2 Answers2