1

My setup is using python3+, django 3.2 with mysql 5.7 on an AWS Amazon linux instance. When I originally created my database and tables, I did not specify a particular charset/encoding. So, I read the following post and determined that my tables and columns are currently latin1: How do I see what character set a MySQL database / table / column is?

I have also read this post to try and understand the differences between what the client uses as encoding and what the table/database is using -- this allows the client to save non-latin1 chars in a mysql table with latin1 charset: MySQL 'set names latin1' seems to cause data to be stored as utf8

Here is some code to show what I am trying to do:

# make a new object
mydata = Dataset()
# set the description. This has a few different non-latin1 characters:
#    smart quotes, long dash, dots over the i
mydata.description = "“naïve—T-cells”"

# this returns an error to prove to myself that there are non-latin1 chars in the string
mydata.description.encode("latin-1")
# Traceback (most recent call last):
#  File "<console>", line 1, in <module>
# UnicodeEncodeError: 'latin-1' codec cant encode character '\u201c' in position 0: 
#      ordinal not in range(256)

# this works though (ie this string can be encoded using cp1252)
mydata.description.encode("cp1252")
# >>>    b'\x93na\xefve\x97T-cells\x94'

# And, it is fine to save it to the mysql table (which has latin1 charset, but I 
# believe this works since the client can handle non-latin1 as I read from above link)
# no error for this:
mydata.save()

# now I try again but with a different non-latin1 character (greater than or equal sign)
mydata.description = "≥4"

# both of these give an error as expected, since the >= character isnt in either charset
mydata.description.encode("latin-1")
mydata.description.encode("cp1252")

# I cant save this non-latin1 char to the database:
mydata.save()
# django.db.utils.OperationalError: (1366, "Incorrect string value: '\\xE2\\x89\\xA54' for column 'description' at row 1")

My question is: why do some non-latin1 chars get saved without a problem, but other non-latin1 chars cause an "OperationalError Incorrect string value" when I try to insert them?

I could probably solve the problem by changing the charset on the mysql tables (Django charset and encoding), but I have my app deployed with several different customers and so this is kind of a challenge (understatement). Instead, I would like to create a step in the data loading process which checks for invalid characters rather than throwing an error so that the user can make the change to the document before loading.

So, my practical question is: how do I know which non-latin1 characters will cause a problem and which are ok? Are all cp1252 characters allowed to be saved but anything beyond cp1252 not allowed?

How can I check what encoding my django client is using? (I don't have anything related to charset or set names in my DATABASE Options in settings.py)

Note: I don't want anything to alter the tables or require a migration. I want to prevent the errors by informing the users about bad chars.

brfox
  • 323
  • 3
  • 14
  • I just recently converted all my tables to utf8mb4_0900_ai_ci to support emojis in production, wasn't really hard "ALTER TABLE my_table_name CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci;" – Jimmy Pells Mar 27 '22 at 00:31

1 Answers1

1

Go to the "mysql" commandline tool. Use it to do SHOW CREATE TABLE tablename; That will tell you the charsets (and collations) for the columns of that table.

SET NAMES latin1; declares that the client encoding is latin1, not cp1252, not UTF-8, etc.

\x93na\xefve\x97T-cells\x94 is the cp1256 or latin1 for “naïve—T-cells”. Hence, the SET should have helped.

latin1 hex:       936E61EF766597542D63656C6C7394
utf8 hex:         E2809C6E61C3AF7665E28094542D63656C6C73E2809D
'double-encoded': C3A2E282ACC5936E61C383C2AF7665C3A2E282ACE2809D542D63656C6C73C3A2E282ACC29D

(My answer in the link was referring to "double encoding" in item 7.)

E289A5 is utf8 for `≥`, which _cannot_ be properly encoded in latin1.

So, if you are seeing in the client, then it is not latin1, and some of the things in your Question need further investigation. Here are then encodings where it will work.

                    binary, utf8mb4, utf8  E289A5
                                    euckr  A1C3
                     gb18030, gb2312, gbk  A1DD
                                  keybcs2  F2
                             koi8r, koi8u  99
                          macce, macroman  B3

The bottom line, is that you should use UTF-8 (MySQL's "utf8mb4") for everything.

Rick James
  • 135,179
  • 13
  • 127
  • 222