I'm trying to migrate a very old MS Access Database to django/mysql. I exported the data in csv-format using a very slow virtual machine which hosts the old database. I'm now trying to write a script that automatically inserts the data from the csv's into MySQL via Django, but I'm running into some issues which probably stem from me not entirely understanding the encoding or encoding in general.
It's a database in wich a lot of the text is in french, so it contains accented characters next to the usual html encoded quotation marks, etc. I was trying to just use the htmlparser approach from over here, but it fails with a UnicodeDecodeError on strings like Chaussée d'Enghien 16
. Which if you od -c
gives:
0000000 C h a u s s 351 e d & # 3 9 ; E
0000020 n g h i e n 1 6 \n
and od -t xC
gives:
0000000 43 68 61 75 73 73 c3 a9 65 20 64 26 23 33 39 3b
0000020 45 6e 67 68 69 65 6e 20 31 36 0a
Checking the Access database again, it shows up the same there as the octal dump here: the actual string has the é
as one character and the '
part as five.
How could I get these kinds of strings into the right encoding? Is this possible with a magical python function? Or is it beter to export the data in another format from the access database?
EDIT: Buried deep within Access, I found the option to export to utf-8, but no cigar.
Also, it seems that unicode decoding only doesn't work when the string contains a special character and a html escaped character.