2

Possible Duplicate:
How to filter (or replace) unicode characters that would take more than 3 bytes in UTF-8?

Background:

I am using Django with MySQL 5.1 and I am having trouble with 4-byte UTF-8 characters causing fatal errors throughout my web application.

I've used a script to convert all tables and columns in my database to UTF-8 which has fixed most unicode issues, but there is still an issue with 4-byte unicode characters. As noted elsewhere, MySQL 5.1 does not support UTF-8 characters over 3 bytes in length.

Whenever I enter a 4-byte unicode character (e.g. ) into a ModelForm on my Django website the form validates and then an exception similar to the following is raised:

Incorrect string value: '\xF0\x9F\x80\x90' for column 'first_name' at row 1

My question:

What is a reasonable way to avoid fatal errors caused by 4-byte UTF-8 characters in a Django web application with a MySQL 5.1 database.

I have considered:

  1. Selectively disabling MySQL warnings to avoid specifically that error message (not sure whether that is possible yet)
  2. Creating middleware that will look through the request.POST QueryDict and substitute/remove all invalid UTF8 characters
  3. Somehow hook/alter/monkey patch the mechanism that outputs SQL queries for Django or for MySQLdb to substitute/remove all invalid UTF-8 characters before the query is executed

Example middleware to replacing invalid characters (inspired by this SO question):

import re

class MySQLUnicodeFixingMiddleware(object):

    INVALID_UTF8_RE = re.compile(u'[^\u0000-\uD7FF\uE000-\uFFFF]', re.UNICODE)

    def process_request(self, request):
        """Replace 4-byte unicode characters by REPLACEMENT CHARACTER"""
        request.POST = request.POST.copy()
        for key, values in request.POST.iterlists():
            request.POST.setlist(key,
                [self.INVALID_UTF8_RE.sub(u'\uFFFD', v) for v in values])
Community
  • 1
  • 1
Trey Hunner
  • 10,975
  • 4
  • 55
  • 114

1 Answers1

1

Do you have an option to upgrade mysql? If you do, you can upgrade and set the encoding to utf8mb4.

Assuming that you don't have the option, I see these options for you:

1) Add java script / frontend validations to prevent entry of anything other than 1,2, or 3 byte unicode characters,

2) Supplement that with a cleanup function in your models to strip the data of any 4 byte unicode characters (which would be your option 2 or 3)

At the same time, it does look like your users are in fact using 4 byte characters. If there is a business case for using them in your application, you could go to the powers that be and request for an upgrade.

alok
  • 1,218
  • 1
  • 12
  • 29