I am rather new to python, but since my native language includes some nasty umlauts, I have to dive into the nightmare that encoding is right at the start. I read joelonsoftware's text on encoding and understand the difference between codepoints and actual renderings of letters (and the connection between unicode and encodings). To get me out of trouble I found 3 ways to deal with umlauts, but I can't decide, which of them suits what situations. If someone could shed some lights on it? I want to be able to write text to file, read from it (or sqlite3) and give out text, all including readable umlauts... Thanks a lot!
# -*- coding: utf-8 -*-
import codecs
# using just u + string
with open("testutf8.txt", "w") as f:
f.write(u"Österreichs Kapitän")
with open("testutf8.txt", "r") as f:
print f.read()
# using encode/decode
s = u'Österreichs Kapitän'
sutf8 = s.encode('UTF-8')
with open('encode_utf-8.txt', 'w') as f2:
f2.write(sutf8)
with open('encode_utf-8.txt','r') as f2:
print f2.read().decode('UTF-8')
# using codec
with codecs.open("testcodec.txt", "w","utf-8") as f3:
f3.write(u"Österreichs Kapitän")
with codecs.open("testcodec.txt", "r","utf-8") as f3:
print f3.read()
EDIT: I tested this (content of file is 'Österreichs Kapitän'):
with codecs.open("testcodec.txt", "r","utf-8") as f3:
s= f3.read()
print s
s= s.replace(u"ä",u"ü")
print s
Do I have to use u'string' (unicode) everywhere in my code? I found out, if I just use the blank string (without 'u'), the replacement of umlauts didn't work...