6

I would like to use the collections.Counter class to count emojis in a string. It generally works fine, however, when I introduce colored emojis the color component of the emoji is separated from the emoji like so:

>>> import collections
>>> emoji_string = ""
>>> emoji_counter = collections.Counter(emoji_string)
>>> emoji_counter.most_common()
[('', 5), ('', 1), ('', 1), ('', 1), ('', 1), ('', 1)]

How can I make the most_common() function return something like this instead:

[('', 1), ('', 1), ('', 1), ('', 1), ('', 1)]

I'm using Python 3.6

Toni Sučić
  • 1,329
  • 1
  • 13
  • 20

2 Answers2

8

You'll have to split your string into separate clusters. Each of your emoji is really two codepoints; the emoji and a EMOJI MODIFIER FITZPATRICK TYPE X codepoint:

>>> print(emoji_string[0])

>>> print(emoji_string[1])

>>> print(emoji_string[:2])

>>> print(ascii(emoji_string[:2]))
'\U0001f44c\U0001f3fb'
>>> import unicodedata
>>> unicodedata.name(emoji_string[1])
'EMOJI MODIFIER FITZPATRICK TYPE-1-2'

You could use a regular expression to keep those with the preceding emoji:

import re

char_with_modifier = re.compile(r'(.[\U0001f3fb-\U0001f3ff]?)')
split_emoji = char_with_modifier.findall(emoji_string)

and count the result.

Demo:

>>> import re
>>> from collections import Counter
>>> emoji_string = ""
>>> char_with_modifier = re.compile(r'(.[\U0001f3fb-\U0001f3ff]?)')
>>> Counter(char_with_modifier.findall(emoji_string))
Counter({'': 1, '': 1, '': 1, '': 1, '': 1})
Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
  • Thanks, this helped a lot. I tweaked the regex a bit because I'd like it to find regular emojis as well and not just the colored ones (.(?:[\U0001f3fb-\U0001f3ff])?) – Toni Sučić May 08 '17 at 18:49
0
import regex
from collections import Counter
emoji_string = ""
data = regex.findall(r'\X',emoji_string)
print(Counter(data))

Expected output

Counter({'': 1, '': 1, '': 1, '': 1, '': 1})
Abu Shoeb
  • 4,747
  • 2
  • 40
  • 45