How to build a regular vocabulary of emoticons in python?

Question

I have a list of codes of emoticons inside a file UTF32.red.codes in plain text. The plain content of the file is

\U0001F600
\U0001F601
\U0001F602
\U0001F603 
\U0001F604
\U0001F605
\U0001F606
\U0001F609
\U0001F60A
\U0001F60B

Based on question, my idea is to create regular expression from the content of the file in order to catch emoticons. This is my minimal working example

import re

with open('UTF32.red.codes','r') as emof:
   codes = [emo.strip() for emo in emof]
   emojis = re.compile(u"(%s)" % "|".join(codes))

string = u'string to check \U0001F601'
found = emojis.findall(string)

print found

found is always empty. Where I am wrong? I am using python 2.7

Where's `string to check ` in your file? That's not supposed to be in `string`, I presume. Also, naming a variable `string` can be confusing, so you may want to refrain from doing so. — Nelewout, Jan 08 '16 at 16:13
Then do `string = u'\U0001F601'`. Even better, use a different variable name, like `search` or something similar. — Nelewout, Jan 08 '16 at 16:17
Do you run into any errors? I think we need more information if we want to solve this problem. — Nelewout, Jan 08 '16 at 16:35
No errors. Simply en empty list at the end of the script. I am using a MacbookPro — emanuele, Jan 08 '16 at 16:38
More important is which version of python are you trying with - a string in python 3 is always Unicode but not under python 2. — Steve Barnes, Jan 09 '16 at 05:57

score 1 · Answer 1 · edited May 23 '17 at 12:03

1

Your code will be doing fine in python 3 (just fix print found to print(found)). However, in python 2.7 it won't work, as its re module has a known bug (See this thread and this issue).

If you still need python 2 version of code, just use regex module, which could be installed with pip2 install regex. Import it with import regex then, substitute all re. statements with regex. (i.e. regex.compile and regex.findall) and that's it. It should be working.

edited May 23 '17 at 12:03

Community

1
1

answered Jan 08 '16 at 17:21

vrs

1,922
16
23

Why do you think the bug is related to this problem? – tripleee Jan 11 '16 at 10:32

emanuele · Accepted Answer · 2016-01-11T13:31:42.827

0

This code works with python 2.7

import re
with open('UTF32.red.codes','rb') as emof:
    codes = [emo.decode('unicode-escape').strip() for emo in emof]
    emojis = re.compile(u"(%s)" % "|".join(map(re.escape,codes)))

search = ur'string to check \U0001F601'
found = emojis.findall(search)

print found

edited Jan 11 '16 at 13:31

answered Jan 11 '16 at 09:17

emanuele

2,519
8
38
56

How to build a regular vocabulary of emoticons in python?

2 Answers2

Linked