wrong text encoding in python

Question

I need to work with text, words like comparing words with dictionary... and I have problem with encoding. txt file is utf-8, the code is utf-8 too. Problem is when splitting to words with characters like š,č,ť,á,... I tried to encode and decode and searched on web but I dont know what to do with it. I looked at filesystemencoding, it is mbcs and defaultencoding is utf-8. Can you somebody help me? Code below is first version.

    #!/usr/bin/env python
    # -*- coding: utf-8 -*-

    f = open("text.txt", "r+")

    text = f.read()

    sentences = re.split("[.!?]\s", text)

    words = re.split("\s", sentences[0])

    print sentences[0]
    print words

and result is:

Nexus 5 patrí v sučasnosti medzi a najlepšie aj smartfóny

['\xef\xbb\xbfNexus', '5', 'patr\xc3\xad', 'su\xc4\x8dasnosti', 'medzi', 'najlep\xc5\xa1ie', 'smartf\xc3\xb3ny']

When I use:

f = codecs.open("text.txt", "r+", encoding="utf-8")

result is:

Nexus 5 patrí v sučasnosti medzi a najlepšie aj smartfóny

[u'\ufeffNexus', u'5', u'patr\xed', u'su\u010dasnosti', u'medzi', u'najlep\u0161ie', u'smartf\xf3ny']

and I need output like:

['Nexus', '5', 'patrí', 'v', 'súčastnosti',....]

You have unicode strings in a list. If you don't want to print representations, don't print the list container but each element separately. — Martijn Pieters, Nov 24 '13 at 13:56
OK now I see but when I want to compare each element of list with dictionary to find a match will it work fine? — TheBP, Nov 24 '13 at 14:12

score 1 · Accepted Answer · answered Nov 24 '13 at 13:57

1

The encoding handling is correct, u'patr\xed' is just the representation of a unicode string in Python. Try print u'patr\xed' in a shell to see for yourself.

Having said that, as you seem to want to use it as a dictionary, it might be useful to use the unidecode module to normalize the unicode strings to ASCII.

answered Nov 24 '13 at 13:57

Elias Dorneles

22,556
11
85
107

I want to compare it with distionary to find a match. How to install unicode with windows? There is only Linux package . – TheBP Nov 24 '13 at 14:14
I think the best way is to [install pip](http://stackoverflow.com/questions/4750806/how-to-install-pip-on-windows) and then just run the command `pip install unidecode`. Unidecode is nice for exactly what you want, you can use it to normalize the dictionary words to ASCII and then later you can do the same to the word you want to look for and see if it is in your dictionary. – Elias Dorneles Nov 24 '13 at 14:50

wrong text encoding in python

1 Answers1