How can I convert JSON-encoded data that contains Unicode surrogate pairs to string?

Question

so I am trying to take this data that uses unicode indicators and make it print with emojis. It is currently in a txt. file but I will write to an excel file later. So anyways I am getting an error I am not sure what to do with. This is the text I am reading:

"Thanks @UglyGod \ud83d\ude4f https:\\/\\/t.co\\/8zVVNtv1o6\"
"RT @Rosssen: Multiculti beatdown \ud83d\ude4f https:\\/\\/t.co\\/fhwVkjhFFC\"

And here is my code:

sampleFile= open('tweets.txt', 'r').read()
splitFile=sampleFile.split('\n')
for line in sampleFile:
    x=line.encode('utf-8')
    print(x.decode('unicode-escape'))

This is the error Message:

UnicodeDecodeError: 'unicodeescape' codec can't decode byte 0x5c in position 0: \ at end of string

Any ideas? This is how the data was originally generated.

class listener(StreamListener):

    def on_data(self, data):
        # Check for a field unique to tweets (if missing, return immediately)
        if "in_reply_to_status_id" not in data:
            return
        with open("see_no_evil_monkey.csv", 'a') as saveFile:
            try:
                saveFile.write(json.dumps(data) + "\n")
            except (BaseException, e):
                print ("failed on data", str(e))
                time.sleep(5)
        return True

    def on_error(self, status):
        print (status)

You are trying to decode a `bytes` object with 'unicode-escape' that was previously encoded with 'utf8', 'unicode-escape' cannot read strings encoded with 'utf8'. I believe the simplest solution to your problem would be to pass the correct encoding to the `open` function when reading from the file. — Dean Fenster, Jun 29 '16 at 18:06
So this is the code that was used to generate the original data from twitter: — Patrick Reid, Jun 30 '16 at 18:28

jfs · Answer 1 · 2016-07-01T12:40:48.157

4

This is how the data was originally generated... saveFile.write(json.dumps(data) + "\n")

You should use json.loads() instead of .decode('unicode-escape') to read JSON text:

#!/usr/bin/env python3
import json

with open('tweets.txt', encoding='ascii') as file:
    for line in file:
        text = json.loads(line)
        print(text)

edited Jul 01 '16 at 12:40

answered Jul 01 '16 at 12:34

jfs

399,953
195
994
1,670

OK, so your method is working for me now when I just write the emoji to a text file. The full contents are `"\ud83d\ude4f"` plus a newline. I guess `json` is handling the surrogate pairs under the hood. Question is, if I have a surrogate pair in a regular string (Py3 unicode string) represented as `"\ud83d\ude4f"` like the OP, how do I process that to print the emoji? Everything I tried gave me errors about surrogate pairs. – MattDMo Jul 01 '16 at 13:14
1

@MattDMo OP uses `json.dumps()` and therefore there are no non-ASCII characters in the file at all (do you see `encoding="ascii"` in my answer). The case you are describing has nothing to do with parsing the result of `json.dumps()` as OP needs. If you have a different question then ask it as a separate Stack Overflow question. – jfs Jul 01 '16 at 13:18
OK, I'll ask a new question. – MattDMo Jul 01 '16 at 13:19

score 3 · Answer 2 · answered Jun 29 '16 at 18:14

3

Your emoji is represented as a surrogate pair, see also here for info about this particular glyph. Python cannot decode surrogates, so you'll need to look at exactly how your tweets.txt file was generated, and try encoding the original tweets, along with the emoji, as UTF-8. This will make reading and processing the text file much easier.

answered Jun 29 '16 at 18:14

MattDMo

100,794
21
241
231

1

Python can decode surrogates just fine. It is how JSON can represent non-BMP characters. – jfs Jul 01 '16 at 12:39

How can I convert JSON-encoded data that contains Unicode surrogate pairs to string?

2 Answers2

Linked