1

I'm using Python to process Weibo (a twitter-like service in China) sentences. There are some emoticons in the sentences, whose corresponding unicode are \ue317 etc. To process the sentence, I need to encode the sentence with gbk, see below:

 string1_gbk = string1.decode('utf-8').encode('gb2312')

There will be a UnicodeEncodeError:'gbk' codec can't encode character u'\ue317'

I tried \\ue[0-9a-zA-Z]{3}, but it did not work. How could I match these emoticons in sentences?

j0k
  • 22,600
  • 28
  • 79
  • 90
bitwjg
  • 47
  • 1
  • 1
  • 7
  • 1
    Is the data coming from Weibo in UTF-8 or in GB2312? Why can't you stick with the encoding of the data as given? – sarnold Jun 05 '12 at 00:54
  • the data from weibo is encoded in utf-8, but I need to process the data with an opensource parser which could only process the sentence encoded with gbk. So I need to complete the transform. – bitwjg Jun 05 '12 at 01:02

3 Answers3

4

'\ue317' is not a substring of u"asdasd \ue317 asad" - it's human-readable unicode character representation, and can not be matched by regexp. regexp works with repr(u'\ue317')

dda
  • 6,030
  • 2
  • 25
  • 34
2

Try

string1_gbk = string1.decode('utf-8').encode('gb2312', 'replace')

Should output ? instead of those emoticons.

Python Docs - Python Wiki

Nick ODell
  • 15,465
  • 3
  • 32
  • 66
1

It may be because the backslash is a special escape character in regexp syntax. The following worked for me:

>>> test_str = 'blah blah blah \ue317 blah blah \ueaa2 blah ue317'
>>> re.findall(r'\\ue[0-9A-Za-z]{3}', test_str)
['\\ue317', '\\ueaa2']

Notice it doesn't erroneously match the ue317 at the end, which has no preceding backslash. Obviously, use re.sub() if you wish to replace those character strings.

Greg E.
  • 2,722
  • 1
  • 16
  • 22
  • I tried this way, but it did not take effect in chinese sentence encoded with utf-8, I do not know the reason. – bitwjg Jun 05 '12 at 01:13
  • It's because `test_str` contains `\ue317`, not the unicode character often represented by `\ue317` – Nick ODell Jun 05 '12 at 01:18
  • @NickODell, I've never had to deal with language encoding issues, but, if he were to attempt to replace those strings representing unicode characters using regexps (which is obviously sub-optimal given your answer above), would he first have to convert into something like iso-8859-1 in order to render them in that backslash-escaped format? – Greg E. Jun 05 '12 at 01:21
  • @Greg, Looks like [asyntax](http://stackoverflow.com/a/10890297/530160) tried to answer you, but doesn't have the comment everywhere privilege yet. – Nick ODell Jun 05 '12 at 01:25