Possible Duplicate:
How to filter (or replace) unicode characters that would take more than 3 bytes in UTF-8?
Background:
I am using Django with MySQL 5.1 and I am having trouble with 4-byte UTF-8 characters causing fatal errors throughout my web application.
I've used a script to convert all tables and columns in my database to UTF-8 which has fixed most unicode issues, but there is still an issue with 4-byte unicode characters. As noted elsewhere, MySQL 5.1 does not support UTF-8 characters over 3 bytes in length.
Whenever I enter a 4-byte unicode character (e.g. ) into a ModelForm on my Django website the form validates and then an exception similar to the following is raised:
Incorrect string value: '\xF0\x9F\x80\x90' for column 'first_name' at row 1
My question:
What is a reasonable way to avoid fatal errors caused by 4-byte UTF-8 characters in a Django web application with a MySQL 5.1 database.
I have considered:
- Selectively disabling MySQL warnings to avoid specifically that error message (not sure whether that is possible yet)
- Creating middleware that will look through the
request.POST
QueryDict
and substitute/remove all invalid UTF8 characters - Somehow hook/alter/monkey patch the mechanism that outputs SQL queries for Django or for MySQLdb to substitute/remove all invalid UTF-8 characters before the query is executed
Example middleware to replacing invalid characters (inspired by this SO question):
import re
class MySQLUnicodeFixingMiddleware(object):
INVALID_UTF8_RE = re.compile(u'[^\u0000-\uD7FF\uE000-\uFFFF]', re.UNICODE)
def process_request(self, request):
"""Replace 4-byte unicode characters by REPLACEMENT CHARACTER"""
request.POST = request.POST.copy()
for key, values in request.POST.iterlists():
request.POST.setlist(key,
[self.INVALID_UTF8_RE.sub(u'\uFFFD', v) for v in values])