Special characters like ç and ã aren't decoded when the text is obtained from a file

Question

I'm learning Python and tried to make a hanging game (literal translation - don't know the real name of the game in English. Sorry.). For those who aren't familiar with this game, the player must discover a secret word by guessing one letter at a time.

In my code, I get a collection of secret words which is imported from a txt file using the following code:

words_bank = open('palavras.txt', 'r')
words = []
for line in words_bank:
  words.append(line.strip().lower())
words_bank.close()
print(words)

The output of print(words) is ['ma\xc3\xa7\xc3\xa3', 'a\xc3\xa7a\xc3\xad', 'tucum\xc3\xa3'] but if I try print('maçã, açaí, tucumã') in order to check the special characters, everything is printed correctly. Looks like the issue is in the encoding (or decoding... I'm still reading lots of articles about it to really understand) special characters from files.

The content of line 1 of my code is # coding: utf-8 because after some research I found out that I have to specify the Unicode format that is required for the text to be encoded/decoded. Before adding it, I was receiving the following message when running the code:

File "path/forca.py", line 12
SyntaxError: Non-ASCII character '\xc3' in file path/forca.py on line 12, but no encoding declared

Line 12 content: print('maçã, açaí, tucumã')

Things that I've already tried:

add encode='utf-8' as parameter in open('palavras.txt', 'r')
add decode='utf-8' as parameter in open('palavras.txt', 'r')
same as above but with latin1
substitute line 1 content for #coding: latin1

My OS is Ubuntu 20.04 LTS, my IDE is VS Code. Nothing works! I don't know what search and what to do anymore.

SOLUTION HERE

Thanks to the help given by the friends above, I was able to find out that the real problem was in the combo VS Code extension (Code Runner) + python alias version from Ubuntu 20.04 LTS.

Code Runner is set to run codes in Terminal in my situation, so apparently, when it calls for python the alias version was python 2.7.x. To overcome this situation I've used this thread to set python 3 as default.

It's done! Whenever python is called, both in terminal and VS Code with Code Runner, all special characters works just fine.

Thank's everybody for your time and your help =)

The `#coding:` comment only applies to your Python code, not to other random files with data in them. Python 3 should read UTF-8 by default on Ubuntu, so I concur that it seems you somehow seem to be using Python 2 in spite of your claims to the contrary. — tripleee, Dec 16 '20 at 13:52
Can you type (in a shell console): `file palavras.txt`? It will tell you which encoding is actually used for your text file. The `#coding` you are adding are only a hint for some text editors (or python interpreter); it does not change the actual encoding, and won't have any effect on your current python code. — Demi-Lune, Dec 16 '20 at 13:52
You're using python2 (just tested in both interpreters). Run your example with `python3` instead. — Niloct, Dec 16 '20 at 13:55
It is of course also possible that your file contains [mojibake](https://en.wikipedia.org/wiki/Mojibake). We can't really tell until you show us the actual bytes in the file. (Though the symptoms with Python 3 would still look different.) See also https://meta.stackoverflow.com/questions/379403/problematic-questions-about-decoding-errors — tripleee, Dec 16 '20 at 13:56

Alastair McCormack · Accepted Answer · 2020-12-16T14:30:56.427

This only happens when using Python 2.x.

The error is probably because you're printing a list not printing items in the list.

When calling print(words) (words is a list), Python invokes a special function called repr on the list object. The list then creates a summary representation of the list by calling repr in each child in the list, then creates a neat string visualisation.

repr(string) actually returns an ASCII representation (with escapes) rather than a suitable version for your terminal.

Instead, try:

for x in words:
    print(x)

Note. The option for open is encoding. E.g

open('myfile.txt', encoding='utf-8')

You should always, always pass the encoding option to open - Python <=3.8 on Linux and Mac will assume UTF-8 (for most people). Python <=3.8 on Windows will use an 8-bit code page. Python 3.9 will always use UTF-8

See Python 2.x vs 3.x behaviour:

Py2

>>> print ['maçã', 'açaí', 'tucumã']
['ma\xc3\xa7\xc3\xa3', 'a\xc3\xa7a\xc3\xad', 'tucum\xc3\xa3']
>>> repr('maçã')
"'ma\\xc3\\xa7\\xc3\\xa3'"
>>> print 'maçã'
maçã

Py3

>>> print(['maçã', 'açaí', 'tucumã'])
['maçã', 'açaí', 'tucumã']
>>> repr('maçã')
"'maçã'"

The output from `print()` also depends on your platform and how Python is configured, especially on Python 2. — tripleee, Dec 16 '20 at 13:58
@tripleee Indeed, it's something I've spent a lot of time working on- see my previous answers. In this instance for brevity, I kept it brief. What I didn't appreciate is Py3 now properly repr on non-ASCII. But note my Py2 vs. Py3 output on my Mac terminal, configured for `en_GB.UTF-8`. — Alastair McCormack, Dec 16 '20 at 14:10

Special characters like ç and ã aren't decoded when the text is obtained from a file

1 Answers1