4

Django version 2.0. Python 3

My database charset and collation:

mysql> SELECT @@character_set_database, @@collation_database;
+--------------------------+----------------------+
| @@character_set_database | @@collation_database |
+--------------------------+----------------------+
| latin1                   | latin1_swedish_ci    |
+--------------------------+----------------------+

Old developer inserted data in KOI8-R encoding using Perl :(

To get correct values from database I used ugly construction str(username).encode('latin1').decode('koi8-r'). And what? I need to use it in all my project to send data to output? Or write function to encode context dictionary, but i also need additional to encode/decode all data. It will affect the usability and productivity

Without this i get something like ëÏÚÌÑÎËÏ òÏÍÁÎ éÏÓÉÆÏ×ÉÞ

How to globally set encoding in Django to prevent encode/decode operation in every place? I changed encoding different ways and nothing happens.

In settings.py I tried to set DEFAULT_CHARSET into different encodings (if I set default_charset to KOI8-R i get an error: UnicodeEncodeError: 'charmap' codec can't encode characters in position 6228-6235: character maps to . With other encodings no errors but no result). I tried to set in Database section of settings.py different values of charset and collation.

'OPTIONS': {
    'charset': 'latin1',
    'init_command': "SET sql_mode='STRICT_TRANS_TABLES', character_set_client=latin1, character_set_results=latin1, character_set_connection=latin1, collation_connection=latin1_swedish_ci",
}

I added <meta http-equiv="Content-type" content="text/html; charset=koi8-r (or other)" /> to <head> tag in index.html template. No result.

It seems that Django execute SET NAMES utf8 everytime

Why in Perl i can send header with charset=koi8-r and i get normal values from this tables in my browser using CGI? Why no similar result in Python with Django or Flask? Simple example in Perl

Serg
  • 109
  • 2
  • 10

3 Answers3

1

I think you're confusing network character encoding with storage encoding. In MySQL the life of string data is roughly this:

disk_storage --decode--> MySQL --encode--> network --decode--> database_driver

When string data is read from the disk, MySQL decodes it using the character_set_database value. When a client connects over the network, the client specifies an encoding for the connection. For Python this is typically UTF-8. MySQL then encodes the data to the connection encoding. The Python Mysql driver then decodes the data it receives using the connection encoding it set.

If anyone of those decodings or encodings use the wrong value, then bad data will be created. If character_set_database has been set incorrectly, then MySQL will be decoding the data incorrectly before encoding the bad data on the network connection.

The solution should be as simple as changing character_set_database to the correct value without changing the actual data.

This can be achieved with:

ALTER DATABASE dbname CHARACTER SET koi8r COLLATE koi8r_general_ci;

(DO NOT RUN ALTER TABLE tbl_name CONVERT.. - this will actually re-encode your data. As the old character_set value was wrong, your data will be decoded incorrectly before encoding to the new encoding)

Change all Python settings back to their default (UTF-8 etc). Do not set DEFAULT_CHARSET or any other value.

To ensure that MySQL driver connects correctly and uses UTF-8 for the network connection set use_unicode=True and charset="utf8"

E.g.

>>> db = MySQLdb.connect(host="localhost", user='root', passwd='passwd', db='sandbox', use_unicode=True, charset="utf8")
Alastair McCormack
  • 26,573
  • 8
  • 77
  • 100
  • Thanks for your reply. I tried to do this earlier - without effect. Changed charset and collation for databse and even tables to koi8r. Simple example of connection in MySQLdb: con = mdb.connect('localhost', 'user', 'passwd', 'database'); And i get þÅÐÒÑÇÏ×Á éÎÅÓÓÁ ÷ÉËÔÏÒÏ×ÎÁ on output. This is without Django! – Serg Jul 27 '18 at 17:33
  • So you ran `ALTER TABLE tbl_name CONVERT TO CHARACTER SET koi8r COLLATE koi8r_general_ci;`? – Alastair McCormack Jul 27 '18 at 17:44
  • If all by default mysql> ALTER TABLE user CONVERT TO CHARACTER SET koi8r COLLATE koi8r_general_ci; ERROR 1366 (HY000): Incorrect string value: '\xF0\xCF\xC7\xCF\xCE\xC9...' for co lumn 'fio' at row 1. If i doing alter table user character set koi8r collate koi8r_general_ci; before this - result is the same (incorrect string value) – Serg Jul 27 '18 at 18:02
  • `ALTER TABLE tbl_name CONVERT` converts the data! That's why you got an error - Mysql decoded the old data using `latin1` then encoded using `koi8r`. That will have borked your data! You need to run **just** the `ALTER DATABASE dbname CHARACTER SET koi8r COLLATE koi8r_general_ci;` command. – Alastair McCormack Jul 27 '18 at 18:07
  • Created a new database with charset koi8r and collation koi8r_general_ci. Imported tables from original dump. Tables in latin1, db in koi8r. Right? | character_set_client | utf8 | | character_set_connection | utf8 | | character_set_database | koi8r | | character_set_filesystem | binary | | character_set_results | utf8 | | character_set_server | latin1 | | character_set_system | utf8 | – Serg Jul 27 '18 at 18:39
  • I added printing charset in connections.py of MySQLdb libbrary and it prints that connection is in latin1 charset, not koi8r. If I add charset='utf8' argument - charset becomes utf8. – Serg Jul 27 '18 at 18:47
  • 1
    No - you're over thinking it and not reading my answer. If you create a new database and import then you risk decoding and re-encoding incorrectly all over again. Restore all Python settings to default. Restore the database from backup as-is. Do not change anything except running `ALTER DATABASE dbname CHARACTER SET koi8r COLLATE koi8r_general_ci;` – Alastair McCormack Jul 27 '18 at 19:45
  • 1
    Correction: To ensure MySQL driver connects correctly, set `use_unicode=True` and `charset="utf8"` on your connection. E.g. `db = MySQLdb.connect(host="localhost", user='root', passwd='passwd', db='sandbox', use_unicode=True, charset="utf8")` – Alastair McCormack Jul 27 '18 at 19:52
  • Yes! On my work PC it works, Thanks! But anyway i can not change charset and collation on a running server with billing system. Billing system written in Perl and any changes of the database will break all. I don't know how Perl get correct data from this database without any encode/decode operations except meta tag with charset=koi8-r in html templates. And I can only modify MySQLdb source as in my answer below or everywere use encode/decode operation? May be You know how to add alias to encodings (python knows koi8r as koi8-r and koi8_r, i need to add koi8r alias)? – Serg Jul 27 '18 at 21:07
0

I solved the problem.

The first way

ALTER DATABASE dbname CHARACTER SET koi8r COLLATE koi8r_general_ci;

and use_unicode=True, charset="utf8" attributes in connection. But I can not change active database on the server, only on the test PC.

The second way

In my virtual environment i have edited lib/python3.6/site-packages/MySQLdb/connections.py:

In Connection class I have added:

self.force_koi8r = kwargs2.pop('force_koi8r', kwargs.get('force_koi8r', False))

and below edited part of code:

def set_character_set(self, charset):
    #if charset == "utf8mb4":
    #    py_charset = "utf8"
    #else:
    #    py_charset = charset

    # bugfix:
    if charset == "utf8mb4":
        py_charset = "utf8"
    elif charset == "latin1" and self.force_koi8r == True:
        py_charset = 'koi8-r'
    else:
        py_charset = charset

Now it works fine with force_koi8r=True argument, but that is not a good solution. This solution is only for this project

Problem was in the title of encoding. Python knows this encoding as 'koi8-r' or 'koi8_r'. But MySQL knows it as 'koi8r'. And if I set charset=koi8-r - mysql gives an error, if I set charset=koi8r - Python gives an error. (unknown encoding)

The third way

If anybody knows how to add an alias of encoding in Python (default: koi8-r, koi8_r and i need to add koi8r) please tell me

Serg
  • 109
  • 2
  • 10
0

I'm pretty sure you need the 2-step ALTER:

You have CHARACTER SET latin1, but have non-latin1 bytes. You need to leave bytes alone while fixing charset:

First, lets assume you have this declaration for tbl.col:

col VARCHAR(111) CHARACTER SET latin1 NOT NULL

To convert the column without changing the bytes:

ALTER TABLE tbl MODIFY COLUMN col VARBINARY(111) NOT NULL;
ALTER TABLE tbl MODIFY COLUMN col VARCHAR(111) CHARACTER SET koi8r NOT NULL;

Note: If you start with TEXT, use BLOB as the intermediate definition. (Be sure to keep the other specifications the same - VARCHAR, NOT NULL, etc.)

-- http://mysql.rjweb.org/doc.php/charcoll#fixes_for_various_cases

As you will see in that link, there are many different fixes for different scenarios. If you have already applied some of the other Answers, you may have made things worse! If you need further help, please provide a sample with

SELECT col, HEX(col) FROM ... WHERE ...

I may be able to deduce where things stand. ëÏÚÌÑÎËÏ òÏÍÁÎ éÏÓÉÆÏ×ÉÞ and þÅÐÒÑÇÏ×Á éÎÅÓÓÁ ÷ÉËÔÏÒÏ×ÎÁ look thoroughly mangled; a quick glance says that maybe a wrong conversion was applied. What was the source Korean text?

Rick James
  • 135,179
  • 13
  • 127
  • 222