8

Look at the following:

/home/kinka/workspace/py/tutorial/tutorial/pipelines.py:33: Warning: Incorrect string 
value: '\xF0\x9F\x91\x8A\xF0\x9F...' for column 't_content' at row 1
n = self.cursor.execute(self.sql, (item['topic'], item['url'], item['content']))

The string '\xF0\x9F\x91\x8A, actually is a 4-byte unicode: u'\U0001f62a'. The mysql's character-set is utf-8 but inserting 4-byte unicode it will truncate the inserted string. I googled for such a problem and found that mysql under 5.5.3 don't support 4-byte unicode, and unfortunately mine is 5.5.224. I don't want to upgrade the mysql server, so I just want to filter the 4-byte unicode in python, I tried to use regular expression but failed. So, any help?

hippietrail
  • 15,848
  • 18
  • 99
  • 158
Kinka
  • 425
  • 6
  • 15

3 Answers3

10

If MySQL cannot handle UTF-8 codes of 4 bytes or more then you'll have to filter out all unicode characters over codepoint \U00010000; UTF-8 encodes codepoints below that threshold in 3 bytes or fewer.

You could use a regular expression for that:

>>> import re
>>> highpoints = re.compile(u'[\U00010000-\U0010ffff]')
>>> example = u'Some example text with a sleepy face: \U0001f62a'
>>> highpoints.sub(u'', example)
u'Some example text with a sleepy face: '

Alternatively, you could use the .translate() function with a mapping table that only contains None values:

>>> nohigh = { i: None for i in xrange(0x10000, 0x110000) }
>>> example.translate(nohigh)
u'Some example text with a sleepy face: '

However, creating the translation table will eat a lot of memory and take some time to generate; it is probably not worth your effort as the regular expression approach is more efficient.

This all presumes you are using a UCS-4 compiled python. If your python was compiled with UCS-2 support then you can only use codepoints up to '\U0000ffff' in regular expressions and you'll never run into this problem in the first place.

I note that as of MySQL 5.5.3 the newly-added utf8mb4 codec does supports the full Unicode range.

Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
  • I tried your code, but it doesn't work. It is `\U`(uppercase `u`). However, your thought is really enlightening, thanks! – Kinka May 29 '12 at 12:59
  • You are quite right; corrected to use proper 8 byte escape sequences. I had some trouble at first due to using a UCS2 compiled python :-P – Martijn Pieters May 29 '12 at 13:10
  • But in my case, it really matters whether lowercase or not. What I use is `highpoints = re.compile(u'[\U00010000-\U0001ffff]')` and it works. It seems that in my computer(is it the problem of the version of python? mine is python 2.7).With uppercase `\U`, the unicode supports much bigger range. – Kinka May 29 '12 at 13:12
  • No, you are quite correct. I've just lowered the lower bound as well, realizing I misread the UTF-8 table. – Martijn Pieters May 29 '12 at 13:12
  • 1
    Any idea why I'm getting an error with: `re.compile(u'[\U00010000-\U0010ffff]')` `"/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/re.py", line 244, in _compile` `raise error, v # invalid expression` `sre_constants.error: bad character range` – Michael Waterfall Sep 05 '12 at 18:33
  • @MichaelWaterfall: You have a UCS2 (2-byte unicode) compiled Python; it'll only support unicode values up to `\uffff`. – Martijn Pieters Sep 05 '12 at 18:38
  • Ah okay, thanks! Out of interest then, why does `print(u'\U0001f3b6')` display the correct emoji character (which is 4-bytes)? – Michael Waterfall Sep 05 '12 at 18:44
  • @MichaelWaterfall: I am not certain how this is all handled; a UCS2 build can print a 4-byte unicode string, but the regular expression module cannot handle it properly because the internal representation cannot handle it. – Martijn Pieters Sep 05 '12 at 19:27
  • @MichaelWaterfall: On a UCS-4 build you can compile the regular expression, on a UCS-2 build you cannot. Python 3.3 does away with the distinction, btw, so there is hope for the future. :-) – Martijn Pieters Sep 05 '12 at 19:28
  • @MichaelWaterfall: Do `import sys; print sys.maxunicode`. If you get `65535` it's a UCS-2 build, `1114111` for a wide UCS-4 build. – Martijn Pieters Sep 05 '12 at 19:30
  • @MichaelWaterfall: On a UCS-2 build, that character is actually *2* bytes, using a UTF-16 surrogate; `len(u'\U0001f3b6') == 2` on such a build. On a UCS-4 build, it's `len(u'\U0001f3b6') == 1`.. – Martijn Pieters Sep 05 '12 at 19:42
  • @MartijnPieters Thank you for the great info. I'd actually just noticed that `len(u'\U0001f3b6') == 2` and figured that was caused by the UCS-2 version of Python. Time to update! Thanks again. – Michael Waterfall Sep 07 '12 at 14:07
2

I think you should use utf8mb4 collation instead of utf8 and run

SET NAMES UTF8MB4

after connection with DB (link, link, link)

Mathias Bynens
  • 144,855
  • 52
  • 216
  • 248
Dmitry
  • 21
  • 2
  • Setting the connection to use `utf8mb4` is the best approach, but you shouldn't do it with `SET NAMES`. This command changes the connection setting at the server end without letting the client library know about the change, which means that anything in the client library using the C `mysql_real_escape_string` API can get bad results. This can result in SQL injection security holes if an East Asian multibyte encoding is one or both of the charsets involved. Character sets should be set at connection time; in python-mysql this would be done with the `charset` argument to `connect()`. – bobince Jul 23 '15 at 11:38
0

simple normalization for string without regex and translate:

def normalize_unicode(s):
    return ''.join([ unichr(k) if k < 0x10000 else 0xfffd for k in [ord(c) for c in s]])
ChCh
  • 1