13

This topic has been addressed for text based emoticons at link1, link2, link3. However, I would like to do something slightly different than matching simple emoticons. I'm sorting through tweets that contain the emoticons' icons. The following unicode information contains just such emoticons: pdf.

Using a string with english words that also contains any of these emoticons from the pdf, I would like to be able to compare the number of emoticons to the number of words.

The direction that I was heading down doesn't seem to be the best option and I was looking for some help. As you can see in the script below, I was just planning to do the work from the command line:

$cat <file containing the strings with emoticons> | ./emo.py

emo.py psuedo script:

import re
import sys

for row in sys.stdin:
    print row.decode('utf-8').encode("ascii","replace")
    #insert regex to find the emoticons
    if match:
       #do some counting using .split(" ")
       #print the counting

The problem that I'm running into is the decoding/encoding. I haven't found a good option for how to encode/decode the string so I can correctly find the icons. An example of the string that I want to search to find the number of words and emoticons is as follows:

"Smiley emoticon rocks!enter image description here I like youenter image description here."

The challenge: can you make a script that counts the number of words and emoticons in this string? Notice that the emoticons are both sitting next to the words with no space in between.

blehman
  • 1,870
  • 7
  • 28
  • 39
  • Is using regexp a requirement here? – abarnert Oct 03 '13 at 01:48
  • This is all pretty basic regex stuff, so… have you read the [Regular Expression HOWTO](http://docs.python.org/3.3/howto/regex.html#regex-howto), or, better, a third-party tutorial? – abarnert Oct 03 '13 at 01:57

4 Answers4

19

First, there is no need to encode here at all. You're got a Unicode string, and the re engine can handle Unicode, so just use it.

A character class can include a range of characters, by specifying the first and last with a hyphen in between. And you can specify Unicode characters that you don't know how to type with \U escape sequences. So:

import re

s=u"Smiley emoticon rocks!\U0001f600 I like you.\U0001f601"
count = len(re.findall(ru'[\U0001f600-\U0001f650]', s))

Or, if the string is big enough that building up the whole findall list seems wasteful:

emoticons = re.finditer(ru'[\U0001f600-\U0001f650]', s)
count = sum(1 for _ in emoticons)

Counting words, you can do separately:

wordcount = len(s.split())

If you want to do it all at once, you can use an alternation group:

word_and_emoticon_count = len(re.findall(ru'\w+|[\U0001f600-\U0001f650]', s))

As @strangefeatures points out, Python versions before 3.3 allowed "narrow Unicode" builds. And, for example, most CPython Windows builds are narrow. In narrow builds, characters can only be in the range U+0000 to U+FFFF. There's no way to search for these characters, but that's OK, because they're don't exist to search for; you can just assume they don't exist if you get an "invalid range" error compiling the regexp.

Except, of course, that there's a good chance that wherever you're getting your actual strings from, they're UTF-16-BE or UTF-16-LE, so the characters do exist, they're just encoded into surrogate pairs. And you want to match those surrogate pairs, right? So you need to translate your search into a surrogate-pair search. That is, convert your high and low code points into surrogate pair code units, then (in Python terms) search for:

(lead == low_lead and lead != high_lead and low_trail <= trail <= DFFF or
 lead == high_lead and lead != low_lead and DC00 <= trail <= high_trail or
 low_lead < lead < high_lead and DC00 <= trail <= DFFF)

You can leave off the second condition in the last case if you're not worried about accepting bogus UTF-16.

If it's not obvious how that translates into regexp, here's an example for the range [\U0001e050-\U0001fbbf] in UTF-16-BE:

(\ud838[\udc50-\udfff])|([\ud839-\ud83d].)|(\ud83e[\udc00-\udfbf])

Of course if your range is small enough that low_lead == high_lead this gets simpler. For example, the original question's range can be searched with:

\ud83d[\ude00-\ude50]

One last trick, if you don't actually know whether you're going to get UTF-16-LE or UTF-16-BE (and the BOM is far away from the data you're searching): Because no surrogate lead or trail code unit is valid as a standalone character or as the other end of a pair, you can just search in both directions:

(\ud838[\udc50-\udfff])|([\ud839-\ud83d][\udc00-\udfff])|(\ud83e[\udc00-\udfbf])|
([\udc50-\udfff]\ud838)|([\udc00-\udfff][\ud839-\ud83d])|([\udc00-\udfbf]\ud83e)
abarnert
  • 354,177
  • 51
  • 601
  • 671
  • I had to change your r' to u' as re.findall(u'[\U0001f600-\U0001f650]', s.decode('utf-8')), which then correctly finds the emoticons. Thanks @abarnert! – blehman Oct 03 '13 at 16:21
  • @simplyclimb: Yeah, you need the `u'…'`—and the `s` variable should _also_ be a unicode string. (For some reason, I assumed you were using Python 3, but looking at you're question, it's obviously 2.x.) But you still want the `r`. In this case, dropping it happens to not matter, because a Python stirng literal will interpret the escape sequence `\U0001f600` the exact same way as the `re` engine would… But it's a good idea to always use raw strings for regexps unless you have a specific reason not to, instead of studying each regexp to figure out whether you need a raw string or not. – abarnert Oct 03 '13 at 18:19
  • 1
    The re `ur'[\U0001f600-\U0001f650]'` fails to compile on some Python builds less than 3.3 (I think narrow builds - ie `sys.maxunicode == 0xffff`), with a "bad character range" error. – Andy MacKinlay Aug 21 '14 at 03:24
  • @strangefeatures: Yeah, IIRC that's considered a "wontfix" bug in the `re` library because the 3.3 Unicode changes made it irrelevant, and because there's no easy fix with UTF-16. I'll update the answer to explain, but no one will like the solution… – abarnert Aug 21 '14 at 21:48
  • @strangefeatures: (Also, I believe we were supposed to be getting `regex` to replace `re` in 2.7/3.2, which could have handled this for you, but then it got deferred to 3.3, then 3.4, and then indefinitely…) – abarnert Aug 21 '14 at 22:17
5

My solution includes the emoji and regex modules. The regex module supports recognizing grapheme clusters (sequences of Unicode codepoints rendered as a single character), so we can count emojis like ‍‍‍ once, although it consists of 4 emojis.

import emoji
import regex

def split_count(text):
    emoji_counter = 0
    data = regex.findall(r'\X', text)
    for word in data:
        if any(char in emoji.UNICODE_EMOJI for char in word):
            emoji_counter += 1
            # Remove from the given text the emojis
            text = text.replace(word, '') 

    words_counter = len(text.split())

    return emoji_counter, words_counter

Testing:

line = "hello ‍ emoji hello ‍‍‍ how are  you today"
counter = split_count(line)
print("Number of emojis - {}, number of words - {}".format(counter[0], counter[1]))

Output:

Number of emojis - 5, number of words - 7
sheldonzy
  • 5,505
  • 9
  • 48
  • 86
  • `UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal if any(char in emoji.UNICODE_EMOJI for char in word):` error – kingmakerking May 31 '18 at 13:48
  • 1
    Easier: `for word in data: if emoji.is_emoji(word): ...` – Emiel Nov 16 '21 at 14:56
  • Even better, you can collapse the whole for loop: `emoji_counter = sum(emoji.is_emoji(word) for word in data)`. If you really need to count words too, like OP, you subtract `emoji_counter` from `word_counter`, no need to do multiple replacements. – Robin De Schepper Apr 06 '22 at 20:14
0

If you are trying to read unicode characters outside the ascii range, don't convert into the ascii range. Just leave it as unicode and work from there (untested):

import sys

count = 0
emoticons = set(range(int('1f600',16), int('1f650', 16)))
for row in sys.stdin:
    for char in row:
        if ord(char) in emoticons:
            count += 1
print "%d emoticons found" % count

Not the best solution, but it should work.

Ethan Furman
  • 63,992
  • 20
  • 159
  • 237
-2

This is my solution using re:

import re
text = "your text with emojis"
em_count = len(re.findall(r'[^\w\s,.]', text))
print(em_count)
AKMalkadi
  • 782
  • 1
  • 5
  • 18