1

I use the following Python code to insert a row into a MySQL table:

city = City()
city.country_id = connection.globe.session.query(Country).\
    filter(Country.code == row[1]).one().id
city.name = row[3].decode('latin1').encode('utf8')
city.province = row[2].decode('latin1').encode('utf8')
[city.latitude, city.longitude] = [row[5], row[6]]

connection.globe.session.add(city)
connection.globe.session.commit()

When testing on my local machine, an example row is inserted correctly:

75,209,36,Radès,36.7681,10.2753

Using the same code from a different machine (AWS) results in a slightly different row:

75,209,36,Radès,36.7681,10.2753

The entirety of the MySQL database is configured to use utf8mb4 encoding and I actually spent a significant amount of time believing that MySQL was to blame for the encoding error. But since I've been running it on different machines, I've noticed that the same code works on one machine, but not another.

Since the same code is being executed, I don't believe my Python code is to blame.

Is there something strange with Linux and character sets / character encoding that I'm missing here?

EDIT: I should note that they are connecting to the same RDS Database, meaning that the database is consistent between both inserts.

  • Where does `row` come from? – Alastair McCormack Jun 09 '17 at 17:28
  • Is this Python 2 or 3? – Alastair McCormack Jun 09 '17 at 17:35
  • It's a CSV reader where the file encoding is latin1, `for row in reader[245:]:` It's Python 2.7. – Anders Hokinson Jun 09 '17 at 17:39
  • And I guess I should further clarify that the code I provided is a snippet of a much larger API application I'm working on. I can provide the entire locations.py file if it might help. – Anders Hokinson Jun 09 '17 at 17:48
  • See my Python Unicode primer here: https://stackoverflow.com/a/35444608/1554386 - Use io.open() with backports CSV and use Unicodes with MySQL. The only reason why encoding may differ from each machine is either you're accidently running on Python 3 or you're not copying the file correctly. – Alastair McCormack Jun 09 '17 at 17:56
  • I will try to look over your primer for the answer, but it's definitely not running Python 3. ```bos-mp469:globe-api ahokinso$ python --version Python 2.7.12``` ```ubuntu@ip-xx-xx-xx-xxx:~$ python --version Python 2.7.12``` – Anders Hokinson Jun 09 '17 at 18:10
  • Drop a debug in your code. Do `print repr('row[3]')` and compare the difference – Alastair McCormack Jun 09 '17 at 18:13
  • Without making any changes, I tried to use your debug suggestion, and the example returned the same value for `repr(row[3])` on each machine which was `'Rad\xe8s'`. While I will definitely try to use io to open as well as the backported CSV, I think the problem is further down the line? – Anders Hokinson Jun 09 '17 at 18:29
  • Also tried print both `row[3]` and `city.name`, which returned the following on both machines: `'Rad\xe8s'` `'Rad\xc3\xa8s'` – Anders Hokinson Jun 09 '17 at 18:35
  • Oh, that's interesting! How did you print the results given in the question? – Alastair McCormack Jun 09 '17 at 19:22
  • I can't really say what the answer is, but through lots of print repr() calls, I was able to determine that .encode() was the failure point. I have no idea why, but inserting unicode representations directly fixed my issue. It might only work because I'm using utf8_unicode_ci as my character collation though? Thanks for all your help pointing me in the right direction @AlastairMcCormack ! – Anders Hokinson Jun 09 '17 at 19:25
  • This tipped me off: ```string = "Meknès" string1 = string.decode('latin1') print repr(string1) u'Mekn\xc3\xa8s' string2 = string1.encode('utf8') print repr(string2) 'Mekn\xc3\x83\xc2\xa8s'``` – Anders Hokinson Jun 09 '17 at 19:26
  • Good stuff. I'm glad you've got it working – Alastair McCormack Jun 09 '17 at 19:29

1 Answers1

0

If all things are configured correctly, there is no need to decode/encode character strings.

Hex E8 is the encoding in latin1; hex C3A8 is the encoding in utf8.

If the data coming in is latin1, declare it as such, then let MySQL convert as you INSERT into a table. LOAD DATA (assuming that is what you are using to read the .csv file) has a CHARACTER SET clause. The connection to the database needs to specify the encoding in the client. The column/table needs to specify the encoding in the column. More discussion (see, especially, "Mojibake"): http://stackoverflow.com/questions/38363566/trouble-with-utf8-characters-what-i-see-is-not-what-i-stored

Bython notes: http://mysql.rjweb.org/doc.php/charcoll#python

Rick James
  • 135,179
  • 13
  • 127
  • 222