1

We've recently started receiving 'utf8' codec can't decode bytes in position 30-31: unexpected end of data when collecting data from the database(using Python and SQLAlchemy). We've located the error to some special Japanese characters.

All tables have CHARSET=utf8 and these are the settings we have on the server when I run show variables;

| character_set_client                    | latin1
| character_set_connection                | latin1
| character_set_database                  | latin1
| character_set_filesystem                | binary
| character_set_results                   | latin1
| character_set_server                    | latin1
| character_set_system                    | utf8
| character_sets_dir                      | /usr/share/mysql/charsets/
| collation_connection                    | latin1_swedish_ci
| collation_database                      | latin1_swedish_ci
| collation_server                        | latin1_swedish_ci

If we wan't to move our environment to utf8 - Which settings are recommended and how should we export and import the current data we have to make it work with the new settings?

I've read some posts about exporting the data as latin1 by adding --default-character-set=latin1 to the mysqldump command and then importing it to the database that have the new settings, but since our original tables already is in utf8, this won't work.

Tried setting the connection after reading this thread: SQLAlchemy and UnicodeDecodeError

This solves the issue of the application crashing, but all the old data is damaged.

Community
  • 1
  • 1
Carl
  • 740
  • 8
  • 18
  • You actually have latin1 data. You need to fetch it, encode it to UTF-8 and then reimport to a UTF-8 collation table. I don't think you can simply change the encoding for the db and tables. Make sure to set the connection to UTF-8 too, in config or using `SET NAMES`. an you make a backup and try by just converting? – Daniel W. Dec 02 '13 at 13:03
  • Tried setting the connection, it solves the crashing application but damages the old data. I've updated the question. – Carl Dec 02 '13 at 13:10
  • So if the tables have utf8 encoding and the connection is latin1, the actual data is still stored as latin1? @DanFromGermany – Carl Dec 02 '13 at 13:11
  • It's generally a bad idea to mix encodings. It's like comparing EXE and PDF files - you can't. They have different bytes, different chars have different bytelength, different header (like BOM). When you have a UTF-8 table and fetch data with PHP using latin1 connection, then you'll get utf-8 data in a latin1 variable, the data will get damaged. – Daniel W. Dec 02 '13 at 13:15
  • You might also encounter double encodings. The best resource for you in this case is the performance blog, the authors are for example MariaDB, Percona, InnoDB core developers: http://www.mysqlperformanceblog.com/2013/10/16/utf8-data-on-latin1-tables-converting-to-utf8-without-downtime-or-double-encoding/ – Daniel W. Dec 02 '13 at 13:22

1 Answers1

0

You can convert the collation of tables with a single query like :

ALTER TABLE <yourtable> CONVERT TO CHARACTER SET utf8 COLLATE utf8_unicode_ci;
Eshant Sahu
  • 360
  • 1
  • 4
  • 18