-1

i have extraxted a number of tweets ( utf-8) in a csv file. I am trying to run a python code to count number of emoticons in each tweet.The emoticons appear in the file as follows:- 💩💩💩💩💩


Now i dont know how to identify these. I tried to covert the whole tring to unicode and then counting them by following code: s=str(strs, "unicode") print(s) print(strs) emoti = re.finditer(r'[\U0001f600-\U0001f650]', s) count = sum(1 for _ in emoti)


but it gives an error as-decoding str is not supported I cant collect all the tweets again, i need to count no. of emoticons on the same set of tweets. can any body tell how to go about it? Thanks in Advance

Arushi Seth
  • 11
  • 1
  • 4

1 Answers1

0

If this string is what you have:

'💩💩💩💩💩'

It has been decoded with the wrong codec. It looks like cp1252 (the Windows ANSI default). Re-encode it with the incorrect coded used, then decode it with utf8. Better yet, fix the source of the incorrect decoding.

>>> '💩💩💩💩💩'.encode('cp1252')
b'\xf0\x9f\x92\xa9\xf0\x9f\x92\xa9\xf0\x9f\x92\xa9\xf0\x9f\x92\xa9\xf0\x9f\x92\xa9'
>>> '💩💩💩💩💩'.encode('cp1252').decode('utf8')
''

Unfortunately there is not a single range of Unicode characters for emoji. See emoji-test.txt from the unicode.org website. That particular character is U+1F4A9, and is outside the Unicode range you have specified in your sample code.

Mark Tolonen
  • 166,664
  • 26
  • 169
  • 251