-2

Possible Duplicate:
Character reading from file in Python

I want to strip a input string from a file from all special characters, except for actual letters (even Cyrillic letters shouldn't be stripped). The solution I found manually declares the string as unicode and the pattern with the re.UNICODE flag so actual letters from different languages are detected.

# -*- coding: utf-8 -*-
import re
pattern = re.compile("[^\w\d]",re.UNICODE)
n_uni = 'ähm whatßs äüöp ×äØü'
uni = u'ähm whatßs äüöp ×äØü'
words = pattern.split(n_uni) #doesn't work
u_words = pattern.split(uni) #works

So if I write the string directly in the source and manually define it as Unicode it gives me the desired output while the non-Unicode string gives me just garbage:

"ähm whatßs äüöp äØü" -> unicode
"hm what s ü p ü" -> non-unicode even with some invalid characters

My question is now how do I define the input from a file as Unicode?

Community
  • 1
  • 1
Zibi92
  • 1
  • 1
  • 2
    Seriously. Searching for "python read unicode file" on Google gives you the relevant documentation as the first hit. And the duplicate StackOverflow question as hit #2. – Tomalak Jul 01 '12 at 12:11

1 Answers1

2

My question is now how do I define the input from a file as unicode?

Straight from the docs.

import codecs
f = codecs.open('unicode.rst', encoding='utf-8')
for line in f:
    print repr(line)
Tomalak
  • 332,285
  • 67
  • 532
  • 628