2

I need to work with text, words like comparing words with dictionary... and I have problem with encoding. txt file is utf-8, the code is utf-8 too. Problem is when splitting to words with characters like š,č,ť,á,... I tried to encode and decode and searched on web but I dont know what to do with it. I looked at filesystemencoding, it is mbcs and defaultencoding is utf-8. Can you somebody help me? Code below is first version.

    #!/usr/bin/env python
    # -*- coding: utf-8 -*-

    f = open("text.txt", "r+")

    text = f.read()

    sentences = re.split("[.!?]\s", text)

    words = re.split("\s", sentences[0])

    print sentences[0]
    print words

and result is:

Nexus 5 patrí v sučasnosti medzi a najlepšie aj smartfóny

['\xef\xbb\xbfNexus', '5', 'patr\xc3\xad', 'su\xc4\x8dasnosti', 'medzi', 'najlep\xc5\xa1ie', 'smartf\xc3\xb3ny']

When I use:

f = codecs.open("text.txt", "r+", encoding="utf-8")

result is:

Nexus 5 patrí v sučasnosti medzi a najlepšie aj smartfóny

[u'\ufeffNexus', u'5', u'patr\xed', u'su\u010dasnosti', u'medzi', u'najlep\u0161ie', u'smartf\xf3ny']

and I need output like:

['Nexus', '5', 'patrí', 'v', 'súčastnosti',....]
Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
TheBP
  • 73
  • 5

1 Answers1

1

The encoding handling is correct, u'patr\xed' is just the representation of a unicode string in Python. Try print u'patr\xed' in a shell to see for yourself.

Having said that, as you seem to want to use it as a dictionary, it might be useful to use the unidecode module to normalize the unicode strings to ASCII.

Elias Dorneles
  • 22,556
  • 11
  • 85
  • 107
  • I want to compare it with distionary to find a match. How to install unicode with windows? There is only Linux package . – TheBP Nov 24 '13 at 14:14
  • I think the best way is to [install pip](http://stackoverflow.com/questions/4750806/how-to-install-pip-on-windows) and then just run the command `pip install unidecode`. Unidecode is nice for exactly what you want, you can use it to normalize the dictionary words to ASCII and then later you can do the same to the word you want to look for and see if it is in your dictionary. – Elias Dorneles Nov 24 '13 at 14:50