Possible Duplicate:
Character reading from file in Python
I want to strip a input string from a file from all special characters, except for actual letters (even Cyrillic letters shouldn't be stripped). The solution I found manually declares the string as unicode and the pattern with the re.UNICODE
flag so actual letters from different languages are detected.
# -*- coding: utf-8 -*-
import re
pattern = re.compile("[^\w\d]",re.UNICODE)
n_uni = 'ähm whatßs äüöp ×äØü'
uni = u'ähm whatßs äüöp ×äØü'
words = pattern.split(n_uni) #doesn't work
u_words = pattern.split(uni) #works
So if I write the string directly in the source and manually define it as Unicode it gives me the desired output while the non-Unicode string gives me just garbage:
"ähm whatßs äüöp äØü" -> unicode
"hm what s ü p ü" -> non-unicode even with some invalid characters
My question is now how do I define the input from a file as Unicode?