How to match a emoticon in sentence with regular expressions

Question

I'm using Python to process Weibo (a twitter-like service in China) sentences. There are some emoticons in the sentences, whose corresponding unicode are \ue317 etc. To process the sentence, I need to encode the sentence with gbk, see below:

 string1_gbk = string1.decode('utf-8').encode('gb2312')

There will be a UnicodeEncodeError:'gbk' codec can't encode character u'\ue317'

I tried \\ue[0-9a-zA-Z]{3}, but it did not work. How could I match these emoticons in sentences?

Is the data coming from Weibo in UTF-8 or in GB2312? Why can't you stick with the encoding of the data as given? — sarnold, Jun 05 '12 at 00:54
the data from weibo is encoded in utf-8, but I need to process the data with an opensource parser which could only process the sentence encoded with gbk. So I need to complete the transform. — bitwjg, Jun 05 '12 at 01:02

score 4 · Answer 1 · edited Jun 05 '12 at 06:17

4

'\ue317' is not a substring of u"asdasd \ue317 asad" - it's human-readable unicode character representation, and can not be matched by regexp. regexp works with repr(u'\ue317')

edited Jun 05 '12 at 06:17

dda

6,030
2
25
34

answered Jun 05 '12 at 01:22

Aleksei astynax Pirogov

2,483
15
19

Nick ODell · Accepted Answer · 2012-06-05T01:30:18.037

2

Try

string1_gbk = string1.decode('utf-8').encode('gb2312', 'replace')

Should output ? instead of those emoticons.

Python Docs - Python Wiki

edited Jun 05 '12 at 01:30

answered Jun 05 '12 at 00:55

Nick ODell

15,465
3
32
66

Thanks,it works, it replace there emoticons with a question mark. – bitwjg Jun 05 '12 at 01:11
I use the 'ignore' instead, it meets my requirement, thank you – bitwjg Jun 05 '12 at 01:17

score 1 · Answer 3 · answered Jun 05 '12 at 00:57

1

It may be because the backslash is a special escape character in regexp syntax. The following worked for me:

>>> test_str = 'blah blah blah \ue317 blah blah \ueaa2 blah ue317'
>>> re.findall(r'\\ue[0-9A-Za-z]{3}', test_str)
['\\ue317', '\\ueaa2']

Notice it doesn't erroneously match the ue317 at the end, which has no preceding backslash. Obviously, use re.sub() if you wish to replace those character strings.

answered Jun 05 '12 at 00:57

Greg E.

2,722
1
16
22

I tried this way, but it did not take effect in chinese sentence encoded with utf-8, I do not know the reason. – bitwjg Jun 05 '12 at 01:13
It's because `test_str` contains `\ue317`, not the unicode character often represented by `\ue317` – Nick ODell Jun 05 '12 at 01:18
@NickODell, I've never had to deal with language encoding issues, but, if he were to attempt to replace those strings representing unicode characters using regexps (which is obviously sub-optimal given your answer above), would he first have to convert into something like iso-8859-1 in order to render them in that backslash-escaped format? – Greg E. Jun 05 '12 at 01:21
@Greg, Looks like [asyntax](http://stackoverflow.com/a/10890297/530160) tried to answer you, but doesn't have the comment everywhere privilege yet. – Nick ODell Jun 05 '12 at 01:25

How to match a emoticon in sentence with regular expressions

3 Answers3

Linked