0

I'm trying to migrate a very old MS Access Database to django/mysql. I exported the data in csv-format using a very slow virtual machine which hosts the old database. I'm now trying to write a script that automatically inserts the data from the csv's into MySQL via Django, but I'm running into some issues which probably stem from me not entirely understanding the encoding or encoding in general.

It's a database in wich a lot of the text is in french, so it contains accented characters next to the usual html encoded quotation marks, etc. I was trying to just use the htmlparser approach from over here, but it fails with a UnicodeDecodeError on strings like Chaussée d'Enghien 16. Which if you od -c gives:

0000000    C   h   a   u   s   s 351   e       d   &   #   3   9   ;   E
0000020    n   g   h   i   e   n       1   6  \n

and od -t xCgives:

0000000    43  68  61  75  73  73  c3  a9  65  20  64  26  23  33  39  3b
0000020    45  6e  67  68  69  65  6e  20  31  36  0a

Checking the Access database again, it shows up the same there as the octal dump here: the actual string has the é as one character and the ' part as five.

How could I get these kinds of strings into the right encoding? Is this possible with a magical python function? Or is it beter to export the data in another format from the access database?

EDIT: Buried deep within Access, I found the option to export to utf-8, but no cigar.

Also, it seems that unicode decoding only doesn't work when the string contains a special character and a html escaped character.

Community
  • 1
  • 1
thepandaatemyface
  • 5,034
  • 6
  • 25
  • 30

0 Answers0