Character encoding and decoding in Python with MySQL

Question

For query:

SHOW VARIABLES LIKE 'char%';

MySQL Database returns:

character_set_client    latin1
character_set_connection    latin1
character_set_database  latin1
character_set_filesystem    binary
character_set_results   latin1
character_set_server    latin1
character_set_system    utf8
character_sets_dir  /usr/local/mysql-5.7.27-macos10.14-x86_64/share/charsets/

In my Python script:

conn = get_database_connection()
conn.setdecoding(pyodbc.SQL_CHAR, encoding='latin1')
conn.setdecoding(pyodbc.SQL_WCHAR, encoding='latin1')

For one of the columns that has following value:

N’a pas

Python returns:

N?a pas

Between N and a, There is a star shaped question-mark. How do I read it as is? What's the best way to handle it? I have been reading about converting my db to utf-8 but that seems like a long shot with a good chance of breaking other things. Is there a more efficient way to do it?

At some of the places in code, I have done :

value = value.encode('utf-8', 'ignore').decode('utf-8')

to handle utf-8 data like accented characters but apostrophe did not get handled with the same and I ended up with ? instead of '

(1) The "fancy" apostrophe `’` (right single quotation mark, U+2019) is not part of Latin-1. Upgrading to UTF-8 is definitely the best option. It's 2020 now, UTF-8 is everywhere. (2) There are very rare cases where `value.encode('utf8', 'ignore').decode('utf8')` has an effect. Typographic quotes are none of them. 99.9% of the time, this expression returns the original `value` unchanged. — lenz, Apr 10 '20 at 05:09
@lenz - UTF-8 would be better. However, the comment is incorrect. Hex 92 is the latin1 encoding of 'RIGHT SINGLE QUOTATION MARK'. — Rick James, Apr 11 '20 at 04:11
@RickJames It depends how you define "Latin-1". Code point 0x92 is a control character in standard Latin-1 (ISO-8859-1). It is a quotation mark in the Windows codepage 1252 (among others), which is a modification of the former and which is colloquially referred to as "Windows Latin 1". I don't know how MySQL defines "Latin-1"; I wouldn't be surprised if it's the latter. — lenz, Apr 11 '20 at 14:18
@lenz - I think that MySQL's latin1 does nothing to validate bytes that it receives. Utf8, on the other hand, squawks at virtually any latin1 string with an 8-bit character, including the 92 in question. — Rick James, Apr 11 '20 at 22:39
@RickJames This is also true. There's no good way to validate any 8-bit encoding, unless you know what the interpreted string should look like. — lenz, Apr 12 '20 at 11:27
@lenz - "validate" -- Do you mean "see if there are any utf-8 characters in the string". Or "Verify that all 8-bit codes are valid utf-8 codes". Or something else. Meanwhile, I think I have seen some techniques with some libraries (don't know about python). — Rick James, Apr 12 '20 at 18:14
@RickJames Given a sequence of bytes, you can check if it is valid UTF-8. If you think some text was stored in UTF-8, this validation will most certainly tell you if this assumption is right. But if the text was stored in an 8-bit encoding like Latin-1, a validation like this is pointless, because any sequence of bytes is valid. Anyway, I'd love to continue this discussion with you, but I fear it's driving us away from the OP's question. — lenz, Apr 12 '20 at 20:25
@lenz - Storing utf8 bytes in a latin1 column is mostly OK. However, sorting and comparisons are broken because the individual bytes will be compared. This is called "double-encoding" in https://stackoverflow.com/questions/38363566/trouble-with-utf8-characters-what-i-see-is-not-what-i-stored — Rick James, Apr 12 '20 at 20:31
@lenz - https://stackoverflow.com/questions/35061775/how-to-detect-chinese-character-in-mysql shows how to check for specifically Chinese. I think I provided a different REGEXP for someone looking for just Arabic. I can pretty easily conjure up similar code for any other language. I think I have a Perl script that will discover whether a file has non-utf8 8-bit codes. — Rick James, Apr 12 '20 at 20:36

Joni · Accepted Answer · 2020-04-10T14:16:56.830

Converting the database to UTF-8 is better for the long run, but risky because you may break other things like you say. What you can do is change the database connection encoding to UTF-8. That way you get UTF-8 encoded strings out of the database, without having changed how the data is actually stored.

conn.setdecoding(pyodbc.SQL_CHAR, encoding='utf8')
conn.setdecoding(pyodbc.SQL_WCHAR, encoding='utf8')

If that seems too risky, but you could consider having two separate database connections, the original and one in utf8, and migrate the app to using utf8 little by little, as you have time to test.

If even that seems too risky, maybe try using a character encoding that's more similar to mysql's version of latin1. MySQL's "latin1" is actually an extended version of cp1252 encoding, which itself is a Microsoft extension of the "standard latin1" that's used in Python (among others).

conn.setdecoding(pyodbc.SQL_CHAR, encoding='cp1252')
conn.setdecoding(pyodbc.SQL_WCHAR, encoding='cp1252')

score 1 · Answer 2 · answered Apr 11 '20 at 04:14

Don't use any form of encoding/decoding; it only complicates your code and hides more errors. In fact, you may be trying to "make two wrongs make a right".

Go with utf8 (or utf8mb4).

Notes on "question mark": Trouble with UTF-8 characters; what I see is not what I stored
Notes on Python: http://mysql.rjweb.org/doc.php/charcoll#python

Character encoding and decoding in Python with MySQL

2 Answers2