Which of those encoding methods is the most reliable one?

Question

I am rather new to python, but since my native language includes some nasty umlauts, I have to dive into the nightmare that encoding is right at the start. I read joelonsoftware's text on encoding and understand the difference between codepoints and actual renderings of letters (and the connection between unicode and encodings). To get me out of trouble I found 3 ways to deal with umlauts, but I can't decide, which of them suits what situations. If someone could shed some lights on it? I want to be able to write text to file, read from it (or sqlite3) and give out text, all including readable umlauts... Thanks a lot!

# -*- coding: utf-8 -*-
import codecs

# using just u + string
with open("testutf8.txt", "w") as f:
    f.write(u"Österreichs Kapitän")

with open("testutf8.txt", "r") as f:
    print f.read()


# using encode/decode
s = u'Österreichs Kapitän'
sutf8 = s.encode('UTF-8')
with open('encode_utf-8.txt', 'w') as f2:
    f2.write(sutf8)
with open('encode_utf-8.txt','r') as f2:
    print f2.read().decode('UTF-8')


# using codec
with codecs.open("testcodec.txt", "w","utf-8") as f3:
    f3.write(u"Österreichs Kapitän")

with codecs.open("testcodec.txt", "r","utf-8") as f3:
    print f3.read()

EDIT: I tested this (content of file is 'Österreichs Kapitän'):

with codecs.open("testcodec.txt", "r","utf-8") as f3:

    s= f3.read()
    print s
    s= s.replace(u"ä",u"ü")
    print s

Do I have to use u'string' (unicode) everywhere in my code? I found out, if I just use the blank string (without 'u'), the replacement of umlauts didn't work...

score 4 · Accepted Answer · edited May 23 '17 at 12:11

As a general rule of thumb, you typically want to decode an encoded string as early as possible, then manipulate it as a unicode object and finally encode it as late as possible (before writing it to a file e.g.).

So e.g.:

with codecs.open("testcodec.txt", "r","utf-8") as f3:
    s = f3.read()

# modify s here

with codecs.open("testcodec.txt", "w","utf-8") as f3:
    f3.write(s)

As to your question, which way is the best to do it: I don't think there is a difference between using the codecs library or using encode/decode manually. It is a matter of preference, either works.

Simply using open, as in your first example, does not work as python will then try to encode the string using the default codec (which is ASCII, if you didn't change it).

Regarding the question whether you should use unicode strings everywhere: In principle, yes. If you create a string s = 'asdf' it has type str (you can check this with type(s)), and if you do s2 = u'asdf' it has type unicode. And since it is better to always manipulate unicode objects, the latter is recommended.

If you don't want to always have to append the 'u' in front of a string, you can use the following import:

from __future__ import unicode_literals

Then you can do s = 'asdf' and s will have the type unicode. In Python3 this is the default, so the import is only needed in Python2.

For potential gotchas you can take a look at Any gotchas using unicode_literals in Python 2.6?. Basically you don't want to mix utf-8 encoded strings and unicode strings.

Thanks alot, that gave me some insight... does posting code not work in comments? — Mike, Jul 02 '13 at 08:10
You can click on the `help` button next to the comment field to learn about the accepted syntax (they call it mini-Markdown). Code in comments should be surrounded by backticks(`). — rkrzr, Jul 02 '13 at 08:36
Thank you. Do I have to write `u"österreich"` to be able to work e.g. replace letters? Please see my edited question for the whole example... — Mike, Jul 02 '13 at 08:43

Which of those encoding methods is the most reliable one?

1 Answers1