Python sets and good encoding

Question

I'm using Python2 and I try to put many words of a french dictionary in a set object, but I always have an encoding problem with the words that have accent.

This is my main code (this part reads a text file):

#!/usr/bin/env python
# -*- encoding: utf-8 -*- 
from sets import Set
with open('.../test_unicode.txt', 'r') as word:
    lines = word.readlines()
    print(lines)

And this is the result of my print:

['\xc3\xa9l\xc3\xa9phants\n', 'bonjour\n', '\xc3\xa9l\xc3\xa8ves\n']

This is my text file for this example:

éléphants
bonjour
élèves

After, this is the continuity of my main code that put the words in a python set:

dict_word = Set()
for line in lines:
    print(line)
    dict_word.add(line[:-1].upper()) #Get rid of the '\n'

print(dict_word)

This is the result of my print:

Set(['\xc3\xa9L\xc3\xa8VES', 'BONJOUR', '\xc3\xa9L\xc3\xa9PHANTS'])

What I want is this output:

Set(['ÉLÈVES', 'BONJOUR', 'ÉLÉPHANTS'])

But I can't figure out a way to have this result I tried many ways including putting this line '# -- encoding: utf-8 --' at the top of my file. I also tried 'with codecs.open()' but it didn't work either.

Thanks!

What is the text file encoding? Is it utf-8, a windows code page, utf-16? — tdelaney, Jul 15 '20 at 19:05
The problem may be that Python won't *output* the string you want. This is why Python 3 uses Unicode for strings always. — Mark Ransom, Jul 15 '20 at 19:15
@MarkRansom - it depends on how you output it. If you print the string and your `sys.stdout.encoding` supports the characters, it will print. — tdelaney, Jul 15 '20 at 19:19
@tdelaney if you're printing Unicode strings then `sys.stdout.encoding` will be used, otherwise a byte string won't be translated. — Mark Ransom, Jul 15 '20 at 19:22
@MarkRansom - If you read a utf-8 encoded file as a `str` and your terminal is utf-8, it will print. You get a string with utf-8 encoding and since its not a `unicode` string, its not decoded, but its the format the terminal expects and it still works. utf-8 and it still works. — tdelaney, Jul 15 '20 at 19:28
@MarkRansom - the same holds for Windows code pages. Python 2 programs tended to work within the set of people using a given code page but fall over when files were shared on machines with different code pages. That's why unicode was late to the game with python and was just "bolted on" in 2.x. — tdelaney, Jul 15 '20 at 19:43
@tdelaney I said that because it's extremely rare for a Windows console to properly display UTF-8 bytes. If your input and output character sets matched then life was grand, which is how Python 2 got away with it for so long. — Mark Ransom, Jul 15 '20 at 19:49
Is there a way to write what's in the python set in another text file? With the capital letters and the accents? — Tom, Jul 15 '20 at 20:28

score 1 · Answer 1 · answered Jul 15 '20 at 19:14

In python 2 you can use the codecs module to read the file with an encoding. Remember that the repr representation of a unicode string will look funky (starts with a u, escapes the unicode stuff) but the actual string is in fact unicode.

#!/usr/bin/env python
# -*- encoding: utf-8 -*- 
from sets import Set
import codecs
with codecs.open('test.txt', encoding='utf-8') as word:
    lines = [line.strip() for line in word.readlines()]
    # since you print the list, it shows you the repr of its values
    print(lines)
    # but they really are unicode
    for line in lines:
        print(line)

The output shows the unicode repr when printing the list, but the real string when printing the strings themselves.

[u'\xe9l\xe9phants', u'bonjour', u'\xe9l\xe8ves']
éléphants
bonjour
élèves

score 0 · Answer 2 · answered Jul 15 '20 at 19:07

The reason is probably that you read the file using the wrong encoding.

In Python 3 you would simply switch:

from with open('.../test_unicode.txt', 'r') as word:
to with open('.../test_unicode.txt', 'r', encoding="utf-8") as word:

In Python 2, it seems you can do something like this: Backporting Python 3 open(encoding="utf-8") to Python 2

I.e. use io.open (you have to import io first), and specify encoding="utf-8". I would have expected this to work with codecs.open as well, if you specify that same keyword argument.

score 0 · Answer 3 · answered Jul 15 '20 at 19:10

You can try to infer input encoding

from sets import Set
import chardet
with open('.../test_unicode.txt', 'rb') as word:
    bin_data = word.readlines()
    enc = chardet.detect(bin_data)
    lines = bin_data.decode(enc['encoding'])
    print(lines)

Python sets and good encoding

3 Answers3