My setup is using python3+, django 3.2 with mysql 5.7 on an AWS Amazon linux instance. When I originally created my database and tables, I did not specify a particular charset/encoding. So, I read the following post and determined that my tables and columns are currently latin1: How do I see what character set a MySQL database / table / column is?
I have also read this post to try and understand the differences between what the client uses as encoding and what the table/database is using -- this allows the client to save non-latin1 chars in a mysql table with latin1 charset: MySQL 'set names latin1' seems to cause data to be stored as utf8
Here is some code to show what I am trying to do:
# make a new object
mydata = Dataset()
# set the description. This has a few different non-latin1 characters:
# smart quotes, long dash, dots over the i
mydata.description = "“naïve—T-cells”"
# this returns an error to prove to myself that there are non-latin1 chars in the string
mydata.description.encode("latin-1")
# Traceback (most recent call last):
# File "<console>", line 1, in <module>
# UnicodeEncodeError: 'latin-1' codec cant encode character '\u201c' in position 0:
# ordinal not in range(256)
# this works though (ie this string can be encoded using cp1252)
mydata.description.encode("cp1252")
# >>> b'\x93na\xefve\x97T-cells\x94'
# And, it is fine to save it to the mysql table (which has latin1 charset, but I
# believe this works since the client can handle non-latin1 as I read from above link)
# no error for this:
mydata.save()
# now I try again but with a different non-latin1 character (greater than or equal sign)
mydata.description = "≥4"
# both of these give an error as expected, since the >= character isnt in either charset
mydata.description.encode("latin-1")
mydata.description.encode("cp1252")
# I cant save this non-latin1 char to the database:
mydata.save()
# django.db.utils.OperationalError: (1366, "Incorrect string value: '\\xE2\\x89\\xA54' for column 'description' at row 1")
My question is: why do some non-latin1 chars get saved without a problem, but other non-latin1 chars cause an "OperationalError Incorrect string value" when I try to insert them?
I could probably solve the problem by changing the charset on the mysql tables (Django charset and encoding), but I have my app deployed with several different customers and so this is kind of a challenge (understatement). Instead, I would like to create a step in the data loading process which checks for invalid characters rather than throwing an error so that the user can make the change to the document before loading.
So, my practical question is: how do I know which non-latin1 characters will cause a problem and which are ok? Are all cp1252 characters allowed to be saved but anything beyond cp1252 not allowed?
How can I check what encoding my django client is using? (I don't have anything related to charset or set names in my DATABASE Options in settings.py)
Note: I don't want anything to alter the tables or require a migration. I want to prevent the errors by informing the users about bad chars.