I get a data from a file:
words = re.findall(r'[\w]+',self._from.encode('utf8'),re.U)
If the file contains:
Hi, how are you?
Then result will be:
['Hi', 'how', 'are', 'you']
But if the file contains russian language (i.e. cyrillic symbols), then:
Привет, как дела?
In this case the result is:
['\xd0', '\xd1', '\xd0', '\xd0\xb2\xd0\xb5\xd1', '\xd0\xba\xd0', '\xd0\xba', '\xd0', '\xd0\xb5\xd0', '\xd0']
why? wtf? I've already added:
sys.setdefaultencoding('utf-8')
I'm using python2.7 and linux ubuntu.
Answer:
words = re.findall(r'[\w]+',self._from.decode('utf8'),re.U)
print u" ".join(words)