2

I have downloaded my Facebook data as json files. The json files for my posts contain emojis, which appear something like this in the json file: \u00f0\u009f\u0098\u008a. I want to parse this json file and extract the posts with the correct emojis.

I can't find a way to load this json file into a json object (using JavaScript) then read (and output) the post with the correct emojis.

(Eventually I will upload these posts to WordPress using its REST API, which I've worked out how to do.)

My program is written in JavaScript and run using nodejs from the command line. I've parsed the file using:

const fs = require('fs')
let filetext = fs.readFileSync(filename, 'utf8')
let jsonObj = JSON.parse(filetext)

However, when I output the data (using something like jsonObj.status_updates.data[0].post), I get strange characters for the emoji, like Happy birthday 😊 instead of Happy birthday . This is not a Windows 10 console display issue because I've piped the output to a file also.

I've used the answer Decode or unescape \u00f0\u009f\u0091\u008d to 👍 to change the \uXXXX sequences in the json file to actual emojis before parsing the file. However, then JSON.parse does not work. It gives this message:

SyntaxError: Unexpected token o in JSON at position 1
    at JSON.parse (<anonymous>)

So I'm in a bind: if I convert the \uXXXX sequences before trying to parse the json file, the JavaScript json parser has an error. If I don't convert the \uXXXX sequences then the parsed file in the form of a json object does not provide the correct emojis!

How can I correctly extract data, including emojis, from the json file?

Praful
  • 45
  • 1
  • 5

1 Answers1

7

I believe you should be able to do all this in Node.js, here's an example. I've tested this using Visual Studio Code.

You can try it here: https://repl.it/repls/BrownAromaticGnudebugger

Note: I've updated processMessageas per @JakubASuplicki's very helpful comments to only look at string properties.

index.js

const fs = require('fs')
let filename = "test.json";
let filetext = fs.readFileSync(filename, "utf8");
let jsonObj = JSON.parse(filetext);

console.log(jsonObj);

function decodeFBString(str) {
    let arr = [];
    for (var i = 0; i < str.length; i++) {
        arr.push(str.charCodeAt(i));
    }
    return Buffer.from(arr).toString("utf8");
}

function processMessages (messageArray) {
    return messageArray.map(processMessage);
}

function processMessage(message) {
    return Object.keys(message).reduce((obj, key) => {
        obj[key] = (typeof message[key] === "string") ? decodeFBString(message[key]): message[key];
        return obj
    }, {});
}

let messages = processMessages(jsonObj.messages);
console.log("Input: ", jsonObj.messages);
console.log("Output: ", messages);

test.json

{
    "participants": [
        {
            "name": "Philip Marlowe"
        },
        {
            "name": "Terry Lennox"
        }
    ],
    "messages": [
        {
            "sender_name": "Philip Marlowe",
            "timestamp_ms": 1546857175,
            "content": "Meet later? \u00F0\u009F\u0098\u008A",
            "type": "Generic"
        },
        {
            "sender_name": "Terry Lennox",
            "timestamp_ms": 1546857177,
            "content": "Excellent!! \u00f0\u009f\u0092\u009a",
            "type": "Generic"
        }
    ]
}
Terry Lennox
  • 29,471
  • 5
  • 28
  • 40
  • Some notes for those using Windows 10. I said I converted the json file using the PowerShell script I referenced. When I did this and output the result to a file, I could view the json file in an editor and the emojis were there. When I ran your script and piped the output to a file then viewed the file in an editor, the emojis were not there! So I debugged your script in VS Code and inspected the message variable. It had the correct emojis. Strange?! I therefore slotted in your decodeFBString function into my Facebook class, sent the output to WordPress and the emojis all appeared. – Praful Jan 07 '19 at 16:02
  • 1
    Great stuff. Thank you for that. One thing to note is that this approach also removes the `timestamp_ms` value. I managed to go around this by converting the milliseconds to the desired data format before passing `messages` to `processMessages()`. – Jakub A Suplicki Mar 09 '20 at 05:36
  • 1
    Thank very much for the info @JakubASuplicki, I guess we do the same with all numeric properties. So I'll add a test to ensure we only convert string properties. – Terry Lennox Mar 09 '20 at 08:57
  • 1
    That would be amazing. I would love to see an updated code then. Thanks! – Jakub A Suplicki Mar 10 '20 at 03:59
  • 1
    So I updated the code, now we check the type of each property, if it's a string we convert, if not we leave it as is. – Terry Lennox Mar 10 '20 at 06:08
  • 1
    Looks good! Thank you. I recently worked on it as well and I also noticed that in Facebook JSON files, content such as GIFs or pictures - anything with an external link is in an `array`. That causes an error when passing content to `decodeFBString(str)` as it expects a `String`. I fixed it by adding a line inside that method which checks whether `str` is an `array` - `Array.isArray(str)`, if it is not then `arr.push(str.charCodeAt(i))` will run just fine - otherwise it will throw an error. Just thought someone might find that useful when dealing with a real-life exported FB JSON files. – Jakub A Suplicki Mar 10 '20 at 23:35