Curious about unicode / string encoding in Python 3

Question

I'd like to ask why something works which I have found after painful hours of reading/trying to understand and in the end simply succesfull trial-and-error...

I'm on Linux (Ubuntu 13.04, German time formats etc., but english system language). My small python 3 script connects to a sqlite3 database of the reference manager Zotero. There I read a couple of keys with the goal of exporting files from the zotero storage directory (probably not important, and as said above, got it working).

All of this works fine with characters in the ascii set, but of course there are a lot of international authors in the database and my code used to fail on non-ascii authors/paper titles. Perhaps first some info about the database on command line sqlite3:

sqlite3 zotero-test.sqlite 
SQLite version 3.7.15.2 2013-01-09 11:53:05
sqlite> PRAGMA encoding;
UTF-8

Exemplary problematic entry:

sqlite> select * from itemattachments;   
317|281|1|application/pdf|5|storage:MÃ¼ller-Forell_2008_Orbitatumoren.pdf||2|1372357574000|2814ef3ea9c50cce2c32d6fb46b977bb

The correct name would be "storage:Müller-Forrell"; Zotero itself decodes this correctly, but SQLIte does not (at least dos not output it correctly in my terminal).

Google tells me that "Ã¼" is a somehow incorrectly or not decoded latin-1/8859-1 "ü".

Reading this database entry in from python3 with

connection = sqlite3.connect("zotero-test.sqlite")`
cursor = connection.cursor()`
cursor.execute("SELECT itemattachments.itemID,itemattachments.sourceItemID,itemattachments.path,items.key FROM itemattachments,items WHERE mimetype=\"application/pdf\" AND items.itemID=itemattachments.itemID")

for pdf_result in cursor:
    print(pdf_result[2])
    print()
    print(pdf_result[2].encode("latin-1").decode("utf-8"))

gives:

storage:MÃ¼ller-Forell_2008_Orbitatumoren.pdf

storage:Müller-Forell_2008_Orbitatumoren.pdf

, the second being correct, so I got my script working (gosh how many hours this cost me...)

Can somebody explain to me what this construction of .encode and .decode does? Which one is even executed first? Thanks for any clues,

Joost

Even with the solutions listed below, there are some gotchas. Namely, the decoding of the filename into utf-8 may print the same, but still be distinct from, the filename as it is stored in the file system. To work around this, working in python3, I applied the following function before comparing strings returned by the database to strings returned by the file system: `import unicodedata`, then canonicalize string s using `unicodedata.normalize('NFKD', s)`. See http://stackoverflow.com/questions/14682397/can-somone-explain-how-unicodedata-normalizeform-unistr-work-with-examples. — Neal Young, Oct 24 '16 at 00:52

score 2 · Answer 1 · answered Jul 03 '13 at 20:25

2

The cursor yields strs. We run encode() on it to convert it to a bytes, and then decode it back into a str. It sounds like the data in the database is misencoded.

answered Jul 03 '13 at 20:25

Ignacio Vazquez-Abrams

776,304
153
1,341
1,358

Hm, ok, print(pdf_result[2].encode("latin-1")) gives me b'storage:M\xc3\xbcller-Forell_2008_Orbitatumoren.pdf' , so this would be a byte string with utf-8 encoding, right? (\xc3\xbc signifies "ü"). BUt I encoded it to latin-1, or did I not? In latin-1 "ü" should be decimal 252. Still don't get it... – Joost Jul 03 '13 at 20:32
1

The strings were converted from Latin-1 to UTF-8 *two* times when they were written into the database. – CL. Jul 03 '13 at 20:43
Thanks for your help in clarifying things for me! J – Joost Jul 03 '13 at 21:23

score 2 · Answer 2 · answered Jul 03 '13 at 20:30

2

What you're seeing here is UTF8 data encoded in latin-1 stored in the SQLite database.

The sqlite module always returns unicode strings, so you first have to encode them into a unicode equivalent of latin-1 and then decode them as UTF8.

They shouldn't have been stored in the db as latin-1 to begin with.

You are executing encode before decode.

answered Jul 03 '13 at 20:30

OregonTrail

8,594
7
43
58

Thanks, I guess I get a grasp of it now! I wonder how I could have noticed this in the beginning, probably experienced programmers would have found out about this on first sight. Thanks again, Joost – Joost Jul 03 '13 at 21:22

Curious about unicode / string encoding in Python 3

2 Answers2