10

I am reading data from a file which contains words with french and english letters. I am attempting to construct a list of all of the possible english and french letters (stored as strings). I do this with the code below:

# encoding: utf-8
def trackLetter(letters, line):
    for a in line:
        found = False;
        for b in letters:
            if b==a:
                found = True
        if not found:
            letters += a

cur_letters = []; # for storing possible letters

data = urllib2.urlopen('https://duolinguist.wordpress.com/2015/01/06/top-5000-words-in-french-wordlist/', 'utf-8')
for line in data:
    trackLetter(cur_letters, line)
    # works if I print here

print cur_letters

This code prints the following:

['t', 'h', 'e', 'o', 'f', 'a', 'n', 'd', 'i', 'r', 's', 'b', 'y', 'w', 'u', 'm', 'l', 'v', 'c', 'p', 'g', 'k', 'x', 'j', 'z', 'q', '\xc3', '\xa0', '\xaa', '\xb9', '\xa9', '\xa8', '\xb4', '\xae', '-', '\xe2', '\x80', '\x99', '\xa2', '\xa7', '\xbb', '\xaf']

Obviously the French letters have been lost in some sort of conversion to ASCII, despite me specifying the UTF encoding! The strange thing is when I print out the line directly (shown as a comment), the french characters appear perfectly!

What should I do to preserve these characters (é, è, ê, etc.), or convert them back to their original version?

David Ferris
  • 2,215
  • 6
  • 28
  • 53
  • 1
    Possible duplicate of [Unicode (utf8) reading and writing to files in python](http://stackoverflow.com/questions/491921/unicode-utf8-reading-and-writing-to-files-in-python) – mx0 Nov 24 '16 at 20:27
  • 3
    No, reading the filie isn't the issue - see the OP's "works if I print here" comment – Greg Nov 24 '16 at 20:39

2 Answers2

7

They aren't lost, they're just escaped when you print the list.

When you print a list in Python 2, it calls the __str__ method of the list itself, not on each individual item, and the list's __str__ method escapes your non-ascii characters. See this excellent answer for more explanation:

How does str(list) work?

The following snippet demonstrates the issue succintly:

char_list = ['é', 'è', 'ê']
print(char_list)
# ['\xc3\xa9', '\xc3\xa8', '\xc3\xaa']

print(', '.join(char_list))
# é, è, ê
Community
  • 1
  • 1
Greg
  • 9,963
  • 5
  • 43
  • 46
  • That's definitely helpful, although it doesn't seem to fix my issue. Your code works perfectly for me, but for some reason when I call `print(''.join(cur_letters))` at the end of my code it gives me the error `[Decode error - output not utf-8]` – David Ferris Nov 24 '16 at 21:02
  • This error is even thrown in my `trackLetter()` function if I call `print type(a)` on the french characters – David Ferris Nov 24 '16 at 21:05
  • Ah.. does it solve your problem if you open the file via `codecs.open("words.txt", "r", "utf-8")`? – Greg Nov 24 '16 at 21:13
  • I simplified the problem in my original post for clarity - I am actually reading lines off a website (see edited post). – David Ferris Nov 24 '16 at 21:20
-1

Not an ideal answer, but as a workaround the french characters can also be added manually:

french_letters = ['é',
        'à', 'è', 'ù',
        'â', 'ê', 'î', 'ô', 'û',
        'ç',
        'ë', 'ï', 'ü']

all_letters = cur_letters + french_letters
David Ferris
  • 2,215
  • 6
  • 28
  • 53