I need to work with text, words like comparing words with dictionary... and I have problem with encoding. txt file is utf-8, the code is utf-8 too. Problem is when splitting to words with characters like š,č,ť,á,... I tried to encode and decode and searched on web but I dont know what to do with it. I looked at filesystemencoding, it is mbcs and defaultencoding is utf-8. Can you somebody help me? Code below is first version.
#!/usr/bin/env python
# -*- coding: utf-8 -*-
f = open("text.txt", "r+")
text = f.read()
sentences = re.split("[.!?]\s", text)
words = re.split("\s", sentences[0])
print sentences[0]
print words
and result is:
Nexus 5 patrí v sučasnosti medzi a najlepšie aj smartfóny
['\xef\xbb\xbfNexus', '5', 'patr\xc3\xad', 'su\xc4\x8dasnosti', 'medzi', 'najlep\xc5\xa1ie', 'smartf\xc3\xb3ny']
When I use:
f = codecs.open("text.txt", "r+", encoding="utf-8")
result is:
Nexus 5 patrí v sučasnosti medzi a najlepšie aj smartfóny
[u'\ufeffNexus', u'5', u'patr\xed', u'su\u010dasnosti', u'medzi', u'najlep\u0161ie', u'smartf\xf3ny']
and I need output like:
['Nexus', '5', 'patrí', 'v', 'súčastnosti',....]