How do I define strings read from a file as Unicode?

Question

Possible Duplicate:
Character reading from file in Python

I want to strip a input string from a file from all special characters, except for actual letters (even Cyrillic letters shouldn't be stripped). The solution I found manually declares the string as unicode and the pattern with the re.UNICODE flag so actual letters from different languages are detected.

# -*- coding: utf-8 -*-
import re
pattern = re.compile("[^\w\d]",re.UNICODE)
n_uni = 'ähm whatßs äüöp ×äØü'
uni = u'ähm whatßs äüöp ×äØü'
words = pattern.split(n_uni) #doesn't work
u_words = pattern.split(uni) #works

So if I write the string directly in the source and manually define it as Unicode it gives me the desired output while the non-Unicode string gives me just garbage:

"ähm whatßs äüöp äØü" -> unicode
"hm what s ü p ü" -> non-unicode even with some invalid characters

My question is now how do I define the input from a file as Unicode?

Seriously. Searching for "python read unicode file" on Google gives you the relevant documentation as the first hit. And the duplicate StackOverflow question as hit #2. — Tomalak, Jul 01 '12 at 12:11

score 2 · Accepted Answer · answered Jul 01 '12 at 12:09

2

My question is now how do I define the input from a file as unicode?

Straight from the docs.

import codecs
f = codecs.open('unicode.rst', encoding='utf-8')
for line in f:
    print repr(line)

answered Jul 01 '12 at 12:09

Tomalak

332,285
67
532
628

Works now was a problem with my setup. – Zibi92 Jul 05 '12 at 11:03

How do I define strings read from a file as Unicode?

1 Answers1