2

I downloaded data from facebook that looks like this:

data = 
    [
            {
                "content": "Hi",
                "sender_name": "J\u00c3\u00a9r\u00c3\u00a9my",
                "timestamp_ms": 1575674161100,
                "type": "Generic"
            },
            {
                "content": "Yeah",
                "sender_name": "Christo",
                "timestamp_ms": 1575674143398,
                "type": "Generic"
            },
            {
                "content": "Hello",
                "sender_name": "William",
                "timestamp_ms": 1575674130441,
                "type": "Generic"
            },
            {
                "content": "Bruh",
                "sender_name": "William",
                "timestamp_ms": 1575674121964,
                "type": "Generic"
            }
        ]

My goal is to take generate a json file containing all messages but without the unicode escape. For exemple, I'd like J\u00c3\u00a9r\u00c3\u00a9my to show as Jérémy. I've tried several things like reading the file line by line doing this:

with open(src_filename, 'r') as src_file:
    with open(dst_filename, 'w') as dst_file:
        for line in src_file:
            dst_file.write(line.encode('latin_1').decode('utf-8'))

It works in the terminal.

u1 = "J\u00c3\u00a9r\u00c3\u00a9my"
print(u1.encode('latin1').decode('utf-8'))

It shows Jérémy in the terminal, but not in my file.

I also tried the json dumps method

with open("filename", "w") as json_file:
    json_string = json.dumps(data, ensure_ascii=False).encode('utf8').decode('utf8')
    json.dump(json_string, json_file, ensure_ascii=False)

but it doesn't recognize some characters: UnicodeEncodeError: 'charmap' codec can't encode character '\x83' in position 276: character maps to <undefined> (Note that my actual data is a lot more messages and mostly in French)

How can I write my data in a json file while showing special french characters such as "é", "à", "è" or other non-ascii characters like "%"?

1 Answers1

1

played around with it for a while until I got it:

with open('filename.json', 'w') as fp:
    json_string = json.dumps(data, ensure_ascii=False).encode('latin1').decode('utf8')
    fp.write(json_string)   

your problem came from trying to json.dump again, while json_string can just be written to the file

Nimrod Morag
  • 938
  • 9
  • 20
  • Thanks for your response, you led me to the right answer. Because the content is already a string, json.dumps() changes "\u00c3" to "\\u00c3" and I can't decode it. I iterated over all my messages and changed the encode for the value itself. ``` for msg in json_data: if 'sender_name' in msg.keys(): msg['sender_name'] = msg['sender_name'].encode('latin1').decode('utf-8', errors="replace") ``` – Jérémy Talbot-Pâquet Dec 14 '19 at 23:05