7

I'm using SQLAlchemy and MySQL, with a files table to store files. That table is defined as follows:

mysql> show full columns in files;
+---------+--------------+-----------------+------+-----+---------+-------+---------------------------------+---------+
| Field   | Type         | Collation       | Null | Key | Default | Extra | Privileges                      | Comment |
+---------+--------------+-----------------+------+-----+---------+-------+---------------------------------+---------+
| id      | varchar(32)  | utf8_general_ci | NO   | PRI | NULL    |       | select,insert,update,references |         |
| created | datetime     | NULL            | YES  |     | NULL    |       | select,insert,update,references |         |
| updated | datetime     | NULL            | YES  |     | NULL    |       | select,insert,update,references |         |
| content | mediumblob   | NULL            | YES  |     | NULL    |       | select,insert,update,references |         |
| name    | varchar(500) | utf8_general_ci | YES  |     | NULL    |       | select,insert,update,references |         |
+---------+--------------+-----------------+------+-----+---------+-------+---------------------------------+---------+

The content column of type MEDIUMBLOB is where the files are stored. In SQLAlchemy that column is declared as:

__maxsize__ = 12582912 # 12MiB                                                                                                                              
content = Column(LargeBinary(length=__maxsize__))                           

I am not quite sure about the difference between SQLAlchemy's BINARY type and LargeBinary type. Or the difference between MySQL's VARBINARY type and BLOB type. And I am not quite sure if that matters here.

Question: Whenever I store an actual binary file in that table, i.e. a Python bytes or b'' object , then I get the following warning

.../python3.4/site-packages/sqlalchemy/engine/default.py:451: Warning: Invalid utf8 character string: 'BCB121'
  cursor.execute(statement, parameters)

I don't want to just ignore the warning, but it seems that the files are in tact. How do I handle this warning gracefully, how can I fix its cause?

Side note: This question seems to be related, and it seems to be a MySQL bug that it tries to convert all incoming data to UTF-8 (this answer).

Community
  • 1
  • 1
Jens
  • 8,423
  • 9
  • 58
  • 78
  • You seem to be running Python. The two "Side notes" refer to PHP and Perl issues. Something in _Python_ is failing to honor your request to use "blob" data. – Rick James Jan 06 '16 at 23:34
  • @RickJames: Yes, all Python. The "[this answer](http://stackoverflow.com/questions/14734812/is-a-blob-converted-using-the-current-default-charset-in-mysql#14745685)" link in the Side note, however, seems to indicate this to be a MySQL issue. If this is a Python issue though, I'd still like to understand what I'm missing here... – Jens Jan 07 '16 at 07:55
  • If the _client_ (PHP, Python, etc) treats the string as "characters", then this problem can arise. If it treats it as arbitrary "bytes", then the problem does not occur. Storing into a MySQL `BLOB` does no utf8 checking; storing into a `TEXT` does. – Rick James Jan 07 '16 at 16:59
  • So, I claim, some detail _in the client_ is missing. (Or Alchemy fails to handle `BLOBs`.) – Rick James Jan 07 '16 at 17:00
  • @RickJames: it's a `b' '` objects, i.e. binary bytes... – Jens Jan 07 '16 at 18:13
  • Which MySQL driver are you using? – eggyal Jan 08 '16 at 18:37

2 Answers2

1

Turns out that this was a driver issue. Apparently the default MySQL driver stumbles with Py3 and utf8 support. Installing cymysql into the virtual Python environment resolved this problem and the warnings disappear.

The fix: Find out if MySQL connects through socket or port (see here), and then modify the connection string accordingly. In my case using a socket connection:

mysql+cymysql://user:pwd@localhost/database?unix_socket=/var/run/mysqld/mysqld.sock

Use the port argument otherwise.

Edit: While the above fixed the encoding issue, it gave rise to another one: blob size. Due to a bug in CyMySQL blobs larger than 8M fail to commit. Switching to PyMySQL fixed that problem, although it seems to have a similar issue with large blobs.

Community
  • 1
  • 1
Jens
  • 8,423
  • 9
  • 58
  • 78
  • Actually, it is apparently the MySQL server that is generating this warning, and the way around it is to use `_binary` to say that the thing you're inserting should not be interpreted. See https://bugs.mysql.com/bug.php?id=79317 – Wodin Apr 26 '16 at 10:55
  • @Wodin: which means to add `binary_prefix=true` to the connect string, see also [this discussion](https://groups.google.com/forum/#!topic/sqlalchemy/am2cd2Cn41M). – Jens Apr 30 '18 at 09:31
0

Not sure, but your problem might have the same roots as the one I had several years ago in python 2.7: https://stackoverflow.com/a/9535736/68998. In short, Mysql's interface does not let you be certain if you are working with a true binary string or a text in a binary collation (used because of a lack of case-sensitive utf8 collation). Therefore, a Mysql binding has the following options:

  • return all string fields as binary strings, and leave the decoding to you
  • decode only the fields that do not have a binary flag (so much fun when some of the fields are unicode and other are str)
  • have an option to force decoding to unicode for all string fields, even true binary

My guess is that in your case, the third option is somewhere enabled in the underlying Mysql binding. And the first suspect is your connection string (connection params).

Community
  • 1
  • 1
newtover
  • 31,286
  • 11
  • 84
  • 89
  • 1
    My current connection string is: `mysql://user@localhost/database?charset=utf8`. According to the [documentation](https://dev.mysql.com/doc/connector-net/en/connector-net-connection-options.html) the `charset` option “Specifies the character set that should be used to encode all queries sent to the server.” Hmm—encode *all* queries? – Jens Jan 10 '16 at 01:09