1

I have a .json file with over 70,000 tweets, with each tweet containing emojis. However, I am unsure how to convert the Unicode into the actual emojis, so that it can be used for sentiment analysis.

This is a sample of 5 tweets in my .json file:

{"text":"The morning is going so fast Part 2 of #DiscoveryDay is in full swing \ud83d\ude01\n\nGreat Atmosphere in the room \n\n#BIGSocial\u2026 https:\/\/t.co\/P08qBoH6tv"}
{"text":"Double kill! #XiuKai lives! I died. \ud83d\ude0c https:\/\/t.co\/QCyk3r2JCb"}
{"text":"ALLTY \ud83d\udc94"}
{"text":"Shouldn\u2019t be normal for a 24 year old to be this tiered \ud83d\udca4"}
{"text":"@TheNames_BrieX Trust me! \ud83d\udcaf"}

Now, how would I convert the unicode for all the tweets into the actual emoji? For instance, how would \ud83d\ude0c be converted into the actual emoji?

What methods can be used to convert the unicode into the actual emojis?

Joachim Sauer
  • 302,674
  • 57
  • 556
  • 614
  • 1
    I think your problems stems mostly from the fact that you are unclear what the "actual emoji" is. A proper JSON parser will convert the `\u` escapes into the appropriate Unicode characters which **are** the Emoji for all intents and purpose, so other than the normal JSON processing, you should require no additional steps. You **are** using a [real JSON parser](https://docs.python.org/3/library/json.html) and don't just treat it as plain text, right? – Joachim Sauer May 28 '21 at 18:16
  • 1
    The snippet you showed isn't JSON, but [*JSON lines*](http://jsonlines.org/). You can't parse all of it with `json.load()` at once, like you would for a regular file. For processing JSON lines with Python, you read the file as text, line by line, and pass each line to `json.loads()`. As Joachim Sauer explained, this will correctly process the `\u` escapes. – lenz May 28 '21 at 18:22
  • @tripleee I think the duplicate candidate you proposed is about a more complicated problem than this here. – lenz May 28 '21 at 18:37
  • @lenz It's possible, of course, but it conveniently works for all other Unicode JSON as well, and the OP's sample does contain surrogates. – tripleee May 28 '21 at 18:41
  • @tripleee That's because JSON uses surrogates always (unless literal characters are used of course). If you properly handle JSON (with a JSON library), you shouldn't have to bother what surrogates even are. – lenz May 28 '21 at 18:43

3 Answers3

3

If this is your actual JSON file content:

{"text":"The morning is going so fast Part 2 of #DiscoveryDay is in full swing \ud83d\ude01\n\nGreat Atmosphere in the room \n\n#BIGSocial\u2026 https:\/\/xxx\/P08qBoH6tv"}
{"text":"Double kill! #XiuKai lives! I died. \ud83d\ude0c https:\/\/xxx\/QCyk3r2JCb"}
{"text":"ALLTY \ud83d\udc94"}
{"text":"Shouldn\u2019t be normal for a 24 year old to be this tiered \ud83d\udca4"}
{"text":"@TheNames_BrieX Trust me! \ud83d\udcaf"}

Then that is JSON Lines format, where each line is a complete JSON structure, and not a single valid JSON file.

Read it a line at a time like so:

import json
with open('test.json') as f:
    for line in f:
        print(json.loads(line))

Output:

{'text': 'The morning is going so fast Part 2 of #DiscoveryDay is in full swing \n\nGreat Atmosphere in the room \n\n#BIGSocial… https://xxx/P08qBoH6tv'}
{'text': 'Double kill! #XiuKai lives! I died.  https://xxx/QCyk3r2JCb'}
{'text': 'ALLTY '}
{'text': 'Shouldn’t be normal for a 24 year old to be this tiered '}
{'text': '@TheNames_BrieX Trust me! '}

Note I had to change the tiny URLs from the original since SO disallows content with them.

If, as you say, that was only a sample of the JSON lines, and it is a fully formed, correct JSON file, then just read it with json.load:

import json
with open('test.json') as f:
    print(json.load(f))
Mark Tolonen
  • 166,664
  • 26
  • 169
  • 251
  • Hello. Thanks for the response, and sorry for the late reply - and thank you for providing me with a solution. Now, from showing the output of each emoji in the tweet, how would the data need to be prepared for sentiment analysis. Like, would ```{'text':``` need to be removed? –  May 31 '21 at 10:28
  • @AnandP2812 Accessing the text value depends on which solution above worked for you – Mark Tolonen May 31 '21 at 16:18
  • @AnandP2812 Assign the line to a variable, e.g. `data = json.loads(line)`, then `print(data['text'])` – Mark Tolonen Jun 01 '21 at 03:55
  • Thanks, that works. However, just one thing: When executing that code, it only returns 1 tweet - not all of them. Any suggestions on how to output all the tweets? Sorry for asking too many questions. –  Jun 01 '21 at 11:15
  • @AnandP2812 put the print inside the loop – Mark Tolonen Jun 01 '21 at 13:32
  • Cool, thanks for your help throughout this question. You have really helped me, so a big thanks. –  Jun 01 '21 at 19:13
-1

Emoji is a subset of unicode. So, there is no conversion from unicode to emoji necessary or possible. Just change your array to

var data = ["\u{1F642}", "\u{1F603}"]

If your input is a hex number, you can use

String.fromCodePoint(parseInt ("1F929", 16))

In HTML you can also use HTML hex entities

"&#x" + "1F618" + ";"
-2

Strings like \ud83d\udcaf is caused by incorrect handling, and could be fixed by data['text'].encode('utf-16', 'surrogatepass').decode('utf-16'). reference.

If you are trying sentiment analysis by rule, the code above could display the actual emojis icon in your terminal, and you could build a label mapping for it, there is no need to convert the original text.

If you are trying sentiment analysis based on statistics or deep learning model, they could capture the semantic information by statistical features or supervised learning, and these emoji tokens may be identified as important features automatically.

lsv
  • 776
  • 5
  • 15
  • 1
    `\ud83d\udcaf` is not caused by incorrect handling at all. This is how non-ASCII characters may be encoded in JSON. – lenz May 28 '21 at 18:32
  • The correct encoded '\ud83d\udcaf' should be '\U0001f4af' – lsv May 28 '21 at 18:37
  • 2
    `"\U0001f4af'` is how you can escape the character in a Python string literal. `\ud83d\udcaf` is JSON. – lenz May 28 '21 at 18:39
  • Try this: `json.dumps('\U0001f4af')`. You will see double backslashes if you try it in a REPL, but if you write it to a file, there will by `"\ud83d\udcaf"`. – lenz May 28 '21 at 18:41