Extract emoticons from a text

Question

I need to extract text emoticons from a text using Python and I've been looking for some solutions to do this but most of them like this or this only cover simple emoticons. I need to parse all of them.

Currently I'm using a list of emoticons that I iterate for every text that I have process but this is so inefficient. Do you know a better solution? Maybe a Python library that can handle this problem?

It might take a lot of time, but that doesn't mean it's slow. — Peter Wood, May 21 '15 at 10:39
@iced Why not? It is likely to be optimised far better than most of what you could do by hand. Try it and see whether it's fast enough. — Peter Wood, May 21 '15 at 12:42
@PeterWood there are better algorythms (see answer) and regexps are wrong solution in 99.83% cases. — iced, May 22 '15 at 14:19
@iced There may be faster algorithms, but the regex might be good enough, and is easy to implement and try out. — Peter Wood, May 22 '15 at 15:48

Luka Rahne · Accepted Answer · 2015-05-21T17:02:27.463

One of most efficient solution is to use Aho–Corasick string matching algorithm and is nontrivial algorithm designed for this kind of problem. (search of multiple predefined strings in unknown text)

There is package available for this.
https://pypi.python.org/pypi/ahocorasick/0.9
https://hkn.eecs.berkeley.edu/~dyoo/python/ahocorasick/

Edit: There are also more recent packages available (haven tried any of them) https://pypi.python.org/pypi/pyahocorasick/1.0.0

Extra:
I have made some performance test with pyahocorasick and it is faster than python re when searching for more than 1 word in dict (2 or more).

Here it is code:

import re, ahocorasick,random,time

# search N words from dict
N=3

#file from http://norvig.com/big.txt
with open("big.txt","r") as f:
    text = f.read()

words = set(re.findall('[a-z]+', text.lower())) 
search_words = random.sample([w for w in words],N)

A = ahocorasick.Automaton()
for i,w in enumerate(search_words):
    A.add_word(w, (i, w))

A.make_automaton()
#test time for ahocorasic
start = time.time()
print("ah matches",sum(1 for i in A.iter(text))) 
print("aho done in ", time.time() - start)


exp = re.compile('|'.join(search_words))
#test time for re
start = time.time()
m = exp.findall(text)
print("re matches",sum(1 for _ in m))
print("re done in ",time.time()-start)

I've been reading about it and seems efficient enough. I'm going to give it a try. Thank you. — David Moreno García, May 21 '15 at 10:52
What pyahocorasick doesn't do is return the start index of the match (just the end). I implemented it my self and is working really well. Thanks again for your answer. — David Moreno García, May 22 '15 at 18:07

Extract emoticons from a text

1 Answers1