Counting nuumber of emoticons in tweets extracted in a file

Question

i have extraxted a number of tweets ( utf-8) in a csv file. I am trying to run a python code to count number of emoticons in each tweet.The emoticons appear in the file as follows:- ðŸ’©ðŸ’©ðŸ’©ðŸ’©ðŸ’©

Now i dont know how to identify these. I tried to covert the whole tring to unicode and then counting them by following code: s=str(strs, "unicode") print(s) print(strs) emoti = re.finditer(r'[\U0001f600-\U0001f650]', s) count = sum(1 for _ in emoti)

but it gives an error as-decoding str is not supported I cant collect all the tweets again, i need to count no. of emoticons on the same set of tweets. can any body tell how to go about it? Thanks in Advance

Read the answers here http://stackoverflow.com/questions/43146528/how-to-extract-all-the-emojis-from-text/43147265#43147265 — Mazdak, Apr 01 '17 at 09:52

score 0 · Answer 1 · answered Apr 01 '17 at 16:48

If this string is what you have:

'ðŸ’©ðŸ’©ðŸ’©ðŸ’©ðŸ’©'

It has been decoded with the wrong codec. It looks like cp1252 (the Windows ANSI default). Re-encode it with the incorrect coded used, then decode it with utf8. Better yet, fix the source of the incorrect decoding.

>>> 'ðŸ’©ðŸ’©ðŸ’©ðŸ’©ðŸ’©'.encode('cp1252')
b'\xf0\x9f\x92\xa9\xf0\x9f\x92\xa9\xf0\x9f\x92\xa9\xf0\x9f\x92\xa9\xf0\x9f\x92\xa9'
>>> 'ðŸ’©ðŸ’©ðŸ’©ðŸ’©ðŸ’©'.encode('cp1252').decode('utf8')
''

Unfortunately there is not a single range of Unicode characters for emoji. See emoji-test.txt from the unicode.org website. That particular character is U+1F4A9, and is outside the Unicode range you have specified in your sample code.

Counting nuumber of emoticons in tweets extracted in a file

1 Answers1