1

I have a table in MySQL (5.7) database, which has collation utf8_unicode_ci, and where I'm inserting some data with Python (3.6).

With some of the strings (for example, '\xCE\xA6') I get "Incorrect string value" error. On the DB side, I can mute this error by turning off the strict mode in MySQL, or changing the field's collation to utf8mb4.

However, such strings are "anomalies", and it is not desirable to change a collation or the sql_mode.

How can I detect in Python 3, that a given string will lead to "incorrect string value" error with MySQL, before inserting into a Table ?

Hayk
  • 627
  • 2
  • 9
  • 19
  • Try [`decode()`](https://stackoverflow.com/a/1168099/6248528) – Solarflare Jun 03 '18 at 14:34
  • @Solarflare, thanks. String doesn't have decode attribute, however, even if one transforms the given string ('\xCE\xA6' in this case) into a byte array and decodes into utf8, it just works in Python. How does one deduce from there, that MySQL will through an error ? – Hayk Jun 03 '18 at 15:36
  • Yes, you need to convert it to a bytearray first. I expected your sequence to be non-unicode (and then python would complain about it and could be used to validate), but it seems to be. So I just tested your 2 bytes on mysql and it works for me; so your problem might be somewhere else (it's oftentimes a connection setting). If you are asking if a specific byte sequence will work in your specific setup (and not if if a specific byte sequence is valid unicode and *should* work), the only way to be 100% sure is to let MySQL verify it, e.g. insert it into a temporary table and catch an error. – Solarflare Jun 03 '18 at 17:26
  • @Solarflare, the connection's charset is set to utf8, so that is not what causes the problem, I think. I read somewhere that utf8 for Python does not coincide with the utf8_unicode_ci collation implemented in MySQL. The problem is that I don't know how to spot the difference. But thanks for your reply, I'll try to dig more. – Hayk Jun 03 '18 at 17:47

1 Answers1

2

Where do you get the error message? What operation is being performed?

C3A6 is the UTF-8 (cf MySQL's utf8 or utf8mb4) hex for æ; does it seem likely that that was the desired character?

To handle utf8 (or utf8mb4), you need to determine what the client's encoding. Sounds like UTF-8. So, when connecting to MySQL, tell it that -- use these in the connect call:

charset="utf8", use_unicode=True

If the character is in the python source, you need

# -*- coding: utf-8 -*-

at the beginning of the source.

Also the column you are inserting into needs to be CHARACTER SET utf8 (or utf8mb4).

utf8mb4 is needed for Emoji and some of Chinese; otherwise it is 'equivalent' to utf8.

Do not use decode() or any other conversion functions; that will just make things harder to fix. In this arena, two wrongs does not make a right; it makes a worse mess.

If you have other symptoms of garbled characters, see Trouble with UTF-8 characters; what I see is not what I stored

To discuss further, please provide the connection call, the SQL statement involved, SHOW CREATE TABLE, and anything else involved.

C3A6 is a valid utf8/utf8mb4 character æ, and could be interpreted as valid, though unlikely, latin1 æ. But it is invalid for CHARACTER SET ascii. (I don't know how the error message occurred unless the connection said ascii or some obscure charset.)

Rick James
  • 135,179
  • 13
  • 127
  • 222
  • thanks for your comments (+1). I could fix the error after all, and that was a silly mistake on my side. I wrote a Python script to populate my database with some test data (randomly building from some real data). So, it modifies over 40 Tables at once (lot of things are connected), and a certain field in one of the tables had latin collation instead of utf8, this was causing the error. – Hayk Jun 04 '18 at 06:24
  • I'm still wondering though, given this https://stackoverflow.com/questions/10957238/incorrect-string-value-when-trying-to-insert-utf-8-into-mysql-via-jdbc question, if the utf8 is understood (slightly) differently in MySQL than it is in Python, for example (of perhaps other languages as well). The answer in the link is from 2012 , is it still so in current versions of MySQL ? – Hayk Jun 04 '18 at 06:27
  • @Hayk - latin1 vs utf8: The table does not matter; the connection does. MySQL is willing to change the encoding on the fly _if_ the connection parameters correctly say what the encoding is in the _client_. – Rick James Jun 04 '18 at 15:46
  • @Hayk - \xF0\x90\x8D\x83 is in `utf8mb4`, but not in `utf8`; the leading F0 is the tip-off. There is no valid `utf8` (3-byte max) encoding starting with F0. (I added a new last paragraph.) – Rick James Jun 04 '18 at 15:51
  • thanks for the comments, but I'm not sure I am following you on ``...The table does not matter; the connection does...". Say, if I have a connection from Python to MySQL (by MySQLDB module, for example), with charset = utf8, then I should get an error if I try to insert a utf8 but non latin1 character into a latin1 collation field, shouldn't I ? – Hayk Jun 04 '18 at 17:25
  • `utf8` is not sufficient for any `F0...` characters; you need `utf8mb4` in `connect(...charset=...)` and in `CREATE TABLE ... CHARACTER SET ...`. – Rick James Jun 04 '18 at 21:35