In Python, how to replace all non-UTF-8 characters in a string?

Question

Update: The real problem is that MySQL utf8 does not support four-byte UTF-8 characters.

There are several questions on this topic, but none of them seems to be my question exactly, except for maybe this one, where the accepted answer does not work for me.

I am coding in Python with the MySQLdb module, and I want to put some text into a MySQL database. The database is configured for UTF-8, but the text occasionally contains ~~non-UTF-8~~ four-byte-UTF-8 characters.

The Python code for the database modification looks like this:

connection = MySQLdb.connect(
    'localhost',
    'root',
    '',
    'mydatabase',
    charset='utf8',
    use_unicode=True)
cursor = connection.cursor()
cursor.execute(
    'update mytable set entryContent=%s where entryName=%s',
    (entryContent, entryName))
connection.commit()

And it currently produces this warning:

./myapp.py:233: Warning: Invalid utf8 character string: 'F09286'
  (entry, word))
./myapp.py:233: Warning: Incorrect string value: '\xF0\x92\x86\xB7\xF0\x92...' for column 'entry' at row 1
  (entryname, entrycontent))

When I look at what actually got into the database with the mysql command-line client, I see the content truncated at the very first occurrence of a ~~non-UTF-8~~ four-byte UTF-8 character.

I don't care about preserving the ~~non-UTF-8~~ four-byte UTF-8 characters, so all I want to do is replace all ~~non-UTF-8~~ four-byte UTF-8 characters with some other valid UTF-8 character, so I can put the text into the database.

@Peter Wood: `'Cognates include Hittite ‎(lāman)'.decode().encode('ascii', 'replace')` produces `UnicodeDecodeError: 'ascii' codec can't decode byte 0xf0 in position 25: ordinal not in range(128)` — davidrmcharles, Mar 28 '16 at 20:17
Sorry, `'Cognates include Hittite ‎(lāman)'.decode('utf-8').encode('ascii', 'replace')`, gives `'Cognates include Hittite ???????? ?(l?man)'` — Peter Wood, Mar 28 '16 at 21:41

Alastair McCormack · Answer 1 · 2016-03-28T22:04:58.810

4

You will need to set your table encoding to utf8mb4 to support 4 byte UTF-8 encoding - https://dev.mysql.com/doc/refman/5.5/en/charset-unicode-utf8mb4.html

Also, the MySQL driver supports Unicode strings, so you should pass Unicode to free your code from encoding specifics:

E.g.

cursor.execute(u'update mytable set entryContent=%s where entryName=%s',
(entryContent.decode("utf-8"), entryName.decode("utf-8")))

Ideally, entryContent and entryName will have been decoded to Unicode earlier in your code when you first receive them. E.g. when opening a file or receiving from the network.

edited Mar 28 '16 at 22:04

answered Mar 28 '16 at 21:56

Alastair McCormack

26,573
8
77
100

It chokes on four-byte UTF-8 characters. Three-byte (and fewer) UTF-8 characters are fine. – davidrmcharles Mar 28 '16 at 21:57
Even when you pass Unicode (opposed to the str shown in your question)? – Alastair McCormack Mar 28 '16 at 21:58
Have you set your table to `utf8mb4` encoding? See: https://dev.mysql.com/doc/refman/5.5/en/charset-unicode-utf8mb4.html – Alastair McCormack Mar 28 '16 at 22:00
No! That may be the true best answer! – davidrmcharles Mar 28 '16 at 22:06
1

Success! This answer was extremely helpful, but there was a little more work required. After reconfiguring the database, table, and column to use `utf8mb4` (instead of just `utf8`) with [this](https://mathiasbynens.be/notes/mysql-utf8mb4), I discovered that MySQLdb.connect() raised a `LookupError` when I requested a connection with `charset=utf8mb4`, but I could change this post-connection by configuring the cursor as described in [this answer](http://stackoverflow.com/a/27390024/1555925). – davidrmcharles Mar 28 '16 at 23:02

score 3 · Answer 2 · answered Mar 28 '16 at 21:41

Turns out the problem is not that I am feeding non-UTF-8 characters to MySQL, but that I am feeding four-byte UTF-8 characters to MySQL when it supports only three-(and fewer)byte UTF-8 characters (according to this documentation)

This solution retains all the supported UTF-8 characters, and converts the unsupported UTF-8 characters to '?':

>>> print ''.join([c if len(c.encode('utf-8')) < 4 else '?' for c in u'Cognates include Hittite  ‎(lāman)'])
Cognates include Hittite ???? ‎(lāman)

Notice that 'ā' is retained
Notice that '' has become '????'

I can put this string into MySQL without the above warnings (and undesirable truncation).

Yes, the real problem was that I did not know of `utf8mb4`. – davidrmcharles Mar 28 '16 at 23:27 — davidrmcharles, Mar 28 '16 at 23:27

score 2 · Answer 3 · edited May 23 '17 at 10:33

2

Could you use a regular expression to remove non-ascii characters? Using your example in the comments:

>>> entry = 'Cognates include Hittite  ‎(lāman)'
>>> entry = ''.join([char if ord(char) < 128 else '' for char in entry])
>>> print entry
Cognates include Hittite  (lman)

This is a slight variation of this answer to a different problem.

edited May 23 '17 at 10:33

Community

1
1

answered Mar 28 '16 at 21:08

cyril

2,976
5
19
26

I want to keep (three-byte) UTF-8 characters, but this answer removes many UTF-8 characters. – davidrmcharles Mar 28 '16 at 21:56

In Python, how to replace all non-UTF-8 characters in a string?

3 Answers3