How to find and count emoticons in a string using python?

Question

This topic has been addressed for text based emoticons at link1, link2, link3. However, I would like to do something slightly different than matching simple emoticons. I'm sorting through tweets that contain the emoticons' icons. The following unicode information contains just such emoticons: pdf.

Using a string with english words that also contains any of these emoticons from the pdf, I would like to be able to compare the number of emoticons to the number of words.

The direction that I was heading down doesn't seem to be the best option and I was looking for some help. As you can see in the script below, I was just planning to do the work from the command line:

$cat <file containing the strings with emoticons> | ./emo.py

emo.py psuedo script:

import re
import sys

for row in sys.stdin:
    print row.decode('utf-8').encode("ascii","replace")
    #insert regex to find the emoticons
    if match:
       #do some counting using .split(" ")
       #print the counting

The problem that I'm running into is the decoding/encoding. I haven't found a good option for how to encode/decode the string so I can correctly find the icons. An example of the string that I want to search to find the number of words and emoticons is as follows:

"Smiley emoticon rocks! enter image description here I like you."

The challenge: can you make a script that counts the number of words and emoticons in this string? Notice that the emoticons are both sitting next to the words with no space in between.

This is all pretty basic regex stuff, so… have you read the [Regular Expression HOWTO](http://docs.python.org/3.3/howto/regex.html#regex-howto), or, better, a third-party tutorial? — abarnert, Oct 03 '13 at 01:57

abarnert · Accepted Answer · 2014-08-21T22:13:46.020

First, there is no need to encode here at all. You're got a Unicode string, and the re engine can handle Unicode, so just use it.

A character class can include a range of characters, by specifying the first and last with a hyphen in between. And you can specify Unicode characters that you don't know how to type with \U escape sequences. So:

import re

s=u"Smiley emoticon rocks!\U0001f600 I like you.\U0001f601"
count = len(re.findall(ru'[\U0001f600-\U0001f650]', s))

Or, if the string is big enough that building up the whole findall list seems wasteful:

emoticons = re.finditer(ru'[\U0001f600-\U0001f650]', s)
count = sum(1 for _ in emoticons)

Counting words, you can do separately:

wordcount = len(s.split())

If you want to do it all at once, you can use an alternation group:

word_and_emoticon_count = len(re.findall(ru'\w+|[\U0001f600-\U0001f650]', s))

As @strangefeatures points out, Python versions before 3.3 allowed "narrow Unicode" builds. And, for example, most CPython Windows builds are narrow. In narrow builds, characters can only be in the range U+0000 to U+FFFF. There's no way to search for these characters, but that's OK, because they're don't exist to search for; you can just assume they don't exist if you get an "invalid range" error compiling the regexp.

Except, of course, that there's a good chance that wherever you're getting your actual strings from, they're UTF-16-BE or UTF-16-LE, so the characters do exist, they're just encoded into surrogate pairs. And you want to match those surrogate pairs, right? So you need to translate your search into a surrogate-pair search. That is, convert your high and low code points into surrogate pair code units, then (in Python terms) search for:

(lead == low_lead and lead != high_lead and low_trail <= trail <= DFFF or
 lead == high_lead and lead != low_lead and DC00 <= trail <= high_trail or
 low_lead < lead < high_lead and DC00 <= trail <= DFFF)

You can leave off the second condition in the last case if you're not worried about accepting bogus UTF-16.

If it's not obvious how that translates into regexp, here's an example for the range [\U0001e050-\U0001fbbf] in UTF-16-BE:

(\ud838[\udc50-\udfff])|([\ud839-\ud83d].)|(\ud83e[\udc00-\udfbf])

Of course if your range is small enough that low_lead == high_lead this gets simpler. For example, the original question's range can be searched with:

\ud83d[\ude00-\ude50]

One last trick, if you don't actually know whether you're going to get UTF-16-LE or UTF-16-BE (and the BOM is far away from the data you're searching): Because no surrogate lead or trail code unit is valid as a standalone character or as the other end of a pair, you can just search in both directions:

(\ud838[\udc50-\udfff])|([\ud839-\ud83d][\udc00-\udfff])|(\ud83e[\udc00-\udfbf])|
([\udc50-\udfff]\ud838)|([\udc00-\udfff][\ud839-\ud83d])|([\udc00-\udfbf]\ud83e)

I had to change your r' to u' as re.findall(u'[\U0001f600-\U0001f650]', s.decode('utf-8')), which then correctly finds the emoticons. Thanks @abarnert! — blehman, Oct 03 '13 at 16:21
@simplyclimb: Yeah, you need the `u'…'`—and the `s` variable should _also_ be a unicode string. (For some reason, I assumed you were using Python 3, but looking at you're question, it's obviously 2.x.) But you still want the `r`. In this case, dropping it happens to not matter, because a Python stirng literal will interpret the escape sequence `\U0001f600` the exact same way as the `re` engine would… But it's a good idea to always use raw strings for regexps unless you have a specific reason not to, instead of studying each regexp to figure out whether you need a raw string or not. — abarnert, Oct 03 '13 at 18:19
The re `ur'[\U0001f600-\U0001f650]'` fails to compile on some Python builds less than 3.3 (I think narrow builds - ie `sys.maxunicode == 0xffff`), with a "bad character range" error. — Andy MacKinlay, Aug 21 '14 at 03:24
@strangefeatures: Yeah, IIRC that's considered a "wontfix" bug in the `re` library because the 3.3 Unicode changes made it irrelevant, and because there's no easy fix with UTF-16. I'll update the answer to explain, but no one will like the solution… — abarnert, Aug 21 '14 at 21:48
@strangefeatures: (Also, I believe we were supposed to be getting `regex` to replace `re` in 2.7/3.2, which could have handled this for you, but then it got deferred to 3.3, then 3.4, and then indefinitely…) — abarnert, Aug 21 '14 at 22:17

score 5 · Answer 2 · answered Mar 12 '18 at 18:51

5

My solution includes the emoji and regex modules. The regex module supports recognizing grapheme clusters (sequences of Unicode codepoints rendered as a single character), so we can count emojis like ‍‍‍ once, although it consists of 4 emojis.

import emoji
import regex

def split_count(text):
    emoji_counter = 0
    data = regex.findall(r'\X', text)
    for word in data:
        if any(char in emoji.UNICODE_EMOJI for char in word):
            emoji_counter += 1
            # Remove from the given text the emojis
            text = text.replace(word, '') 

    words_counter = len(text.split())

    return emoji_counter, words_counter

Testing:

line = "hello ‍ emoji hello ‍‍‍ how are  you today"
counter = split_count(line)
print("Number of emojis - {}, number of words - {}".format(counter[0], counter[1]))

Output:

Number of emojis - 5, number of words - 7

answered Mar 12 '18 at 18:51

sheldonzy

5,505
9
48
86

`UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal if any(char in emoji.UNICODE_EMOJI for char in word):` error – kingmakerking May 31 '18 at 13:48
1

Easier: `for word in data: if emoji.is_emoji(word): ...` – Emiel Nov 16 '21 at 14:56
Even better, you can collapse the whole for loop: `emoji_counter = sum(emoji.is_emoji(word) for word in data)`. If you really need to count words too, like OP, you subtract `emoji_counter` from `word_counter`, no need to do multiple replacements. – Robin De Schepper Apr 06 '22 at 20:14

score 0 · Answer 3 · answered Oct 03 '13 at 01:16

If you are trying to read unicode characters outside the ascii range, don't convert into the ascii range. Just leave it as unicode and work from there (untested):

import sys

count = 0
emoticons = set(range(int('1f600',16), int('1f650', 16)))
for row in sys.stdin:
    for char in row:
        if ord(char) in emoticons:
            count += 1
print "%d emoticons found" % count

Not the best solution, but it should work.

score -2 · Answer 4 · answered Feb 08 '22 at 04:04

-2

This is my solution using re:

import re
text = "your text with emojis"
em_count = len(re.findall(r'[^\w\s,.]', text))
print(em_count)

answered Feb 08 '22 at 04:04

AKMalkadi

782
1
5
18

Could you explain your code only answer? – Robin De Schepper Apr 06 '22 at 19:48
Your code is non-obvious: how does the regex pattern work, what does `findall` do (link to the docs). That's the bare minimum you could do to explain your answer – Robin De Schepper Apr 07 '22 at 20:16
We are not learning python here. There are enough sources and official docs explaining those functions. Also, there are docs teach you regular expression – AKMalkadi Apr 08 '22 at 21:05

How to find and count emoticons in a string using python?

4 Answers4

Linked