53

I downloaded my Facebook messenger data (in your Facebook account, go to settings, then to Your Facebook information, then Download your information, then create a file with at least the Messages box checked) to do some cool statistics

However there is a small problem with encoding. I'm not sure, but it looks like Facebook used bad encoding for this data. When I open it with text editor I see something like this: Rados\u00c5\u0082aw. When I try to open it with python (UTF-8) I get RadosÅ\x82aw. However I should get: Radosław.

My python script:

text = open(os.path.join(subdir, file), encoding='utf-8')
conversations.append(json.load(text))

I tried a few most common encodings. Example data is:

{
  "sender_name": "Rados\u00c5\u0082aw",
  "timestamp": 1524558089,
  "content": "No to trzeba ostatnie treningi zrobi\u00c4\u0087 xD",
  "type": "Generic"
}
Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
Jakub Jendryka
  • 599
  • 1
  • 4
  • 6
  • Why do you assume that the data is UTF-8 ? If you don't know its encoding, have you tried other reasonable possibilities e.g. windows 1250 or ISO 8859-2? – Peteris Apr 24 '18 at 18:21
  • I tried a few of them. None worked. I encountered this question asked ealier: https://stackoverflow.com/questions/19161501/reading-json-what-encoding-is-u00c5-u0082-how-do-i-get-it-to-a-unicode-obje however I have no idea how to make it work for me – Jakub Jendryka Apr 24 '18 at 18:28
  • no idea if it helps, but emojies encoding seeems to be funky in facebooks api: https://stackoverflow.com/questions/20045268/how-does-facebook-encode-emoji-in-the-json-graph-api – Patrick Artner Apr 24 '18 at 18:44
  • 1
    @JakubJendryka: right, I'm not familiar with that system and perhaps there is indeed a mojibake in there; UTF-8 data being decoded as Latin-1 and then encoded as JSON. – Martijn Pieters Apr 24 '18 at 18:50
  • @Patrick: that’s pretty much ancient history by now. We no longer use that encoding (and that only applies to Emoji). – Martijn Pieters Apr 24 '18 at 23:39
  • This one did it for me: https://stackoverflow.com/a/5396742/2297366 – Dylan Vander Berg Jan 18 '20 at 04:29
  • [For those using .NET C# solution](https://stackoverflow.com/a/50803989/396337) – Zyo Feb 23 '21 at 13:24

9 Answers9

72

I can indeed confirm that the Facebook download data is incorrectly encoded; a Mojibake. The original data is UTF-8 encoded but was decoded as Latin-1 instead. I’ll make sure to file a bug report.

What this means is that any non-ASCII character in the string data was encoded twice. First to UTF-8, and then the UTF-8 bytes were encoded again by interpreting them as Latin-1 encoded data (which maps exactly 256 characters to the 256 possible byte values), by using the \uHHHH JSON escape notation (so a literal backslash, a literal lowercase letter u, followed by 4 hex digits, 0-9 and a-f). Because the second step encoded byte values in the range 0-255, this resulted in a series of \u00HH sequences (a literal backslash, a literal lower case letter u, two 0 zero digits and two hex digits).

E.g. the Unicode character U+0142 LATIN SMALL LETTER L WITH STROKE in the name Radosław was encoded to the UTF-8 byte values C5 and 82 (in hex notation), and then encoded again to \u00c5\u0082.

You can repair the damage in two ways:

  1. Decode the data as JSON, then re-encode any string values as Latin-1 binary data, and then decode again as UTF-8:

     >>> import json
     >>> data = r'"Rados\u00c5\u0082aw"'
     >>> json.loads(data).encode('latin1').decode('utf8')
     'Radosław'
    

    This would require a full traversal of your data structure to find all those strings, of course.

  2. Load the whole JSON document as binary data, replace all \u00hh JSON sequences with the byte the last two hex digits represent, then decode as JSON:

     import re
     from functools import partial
    
     fix_mojibake_escapes = partial(
         re.compile(rb'\\u00([\da-f]{2})').sub,
         lambda m: bytes.fromhex(m[1].decode()),
     )
    
     with open(os.path.join(subdir, file), 'rb') as binary_data:
         repaired = fix_mojibake_escapes(binary_data.read())
     data = json.loads(repaired)
    

    (If you are using Python 3.5 or older, you'll have to decode the repaired bytes object from UTF-8, so use json.loads(repaired.decode())).

    From your sample data this produces:

     {'content': 'No to trzeba ostatnie treningi zrobić xD',
      'sender_name': 'Radosław',
      'timestamp': 1524558089,
      'type': 'Generic'}
    

    The regular expression matches against all \u00HH sequences in the binary data and replaces those with the bytes they represent, so that the data can be decoded correctly as UTF-8. The second decoding is taken care of by the json.loads() function when given binary data.

Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
  • I get 'Nedirbsiu\U0001f972' on Python version 3.8.8, but Nedirbsiu on 3.10.2, so I guess you are right. And thanks for the explanation! – brikas Aug 29 '22 at 14:56
  • @brikas: `\U0001f972` is the escape sequence for the [U+1F972 SMILING FACE WITH TEAR](https://www.fileformat.info/info/unicode/char/1f972/index.htm) codepoint; Python 3.8 was released with Unicode 12.0.0 support and it uses `\xHH` / `\uHHHH` / `\UHHHHHHHH` escapes (plus `\n`, `\t` and `\r`) for any codepoint not marked as _printable_ in that standard. Since U+1F972 was defined in Unicode 13.0.0 Python 3.8 doesn't know it is a printable codepoint. – Martijn Pieters Aug 29 '22 at 15:49
13

Here is a command-line solution with jq and iconv. Tested on Linux.

cat message_1.json | jq . | iconv -f utf8 -t latin1 > m1.json

luksan
  • 186
  • 1
  • 5
8

I would like to extend @Geekmoss' answer with the following recursive code snippet, I used to decode my facebook data.

import json

def parse_obj(obj):
    if isinstance(obj, str):
        return obj.encode('latin_1').decode('utf-8')

    if isinstance(obj, list):
        return [parse_obj(o) for o in obj]

    if isinstance(obj, dict):
        return {key: parse_obj(item) for key, item in obj.items()}

    return obj

decoded_data = parse_obj(json.loads(file))

I noticed this works better, because the facebook data you download might contain list of dicts, in which case those dicts would be just returned 'as is' because of the lambda identity function.

hotigeftas
  • 151
  • 1
  • 3
  • 9
6

My solution for parsing objects use parse_hook callback on load/loads function:

import json


def parse_obj(dct):
    for key in dct:
        dct[key] = dct[key].encode('latin_1').decode('utf-8')
        pass
    return dct


data = '{"msg": "Ahoj sv\u00c4\u009bte"}'

# String
json.loads(data)  
# Out: {'msg': 'Ahoj svÄ\x9bte'}
json.loads(data, object_hook=parse_obj)  
# Out: {'msg': 'Ahoj světe'}

# File
with open('/path/to/file.json') as f:
     json.load(f, object_hook=parse_obj)
     # Out: {'msg': 'Ahoj světe'}
     pass

Update:

Solution for parsing list with strings does not working. So here is updated solution:

import json


def parse_obj(obj):
    for key in obj:
        if isinstance(obj[key], str):
            obj[key] = obj[key].encode('latin_1').decode('utf-8')
        elif isinstance(obj[key], list):
            obj[key] = list(map(lambda x: x if type(x) != str else x.encode('latin_1').decode('utf-8'), obj[key]))
        pass
    return obj
Geekmoss
  • 637
  • 6
  • 11
1

Based on @Martijn Pieters solution, I wrote something similar in Java.

public String getMessengerJson(Path path) throws IOException {
    String badlyEncoded = Files.readString(path, StandardCharsets.UTF_8);
    String unescaped = unescapeMessenger(badlyEncoded);
    byte[] bytes = unescaped.getBytes(StandardCharsets.ISO_8859_1);
    String fixed = new String(bytes, StandardCharsets.UTF_8);
    return fixed;
}

The unescape method is inspired by the org.apache.commons.lang.StringEscapeUtils.

private String unescapeMessenger(String str) {
    if (str == null) {
        return null;
    }
    try {
        StringWriter writer = new StringWriter(str.length());
        unescapeMessenger(writer, str);
        return writer.toString();
    } catch (IOException ioe) {
        // this should never ever happen while writing to a StringWriter
        throw new UnhandledException(ioe);
    }
}

private void unescapeMessenger(Writer out, String str) throws IOException {
    if (out == null) {
        throw new IllegalArgumentException("The Writer must not be null");
    }
    if (str == null) {
        return;
    }
    int sz = str.length();
    StrBuilder unicode = new StrBuilder(4);
    boolean hadSlash = false;
    boolean inUnicode = false;
    for (int i = 0; i < sz; i++) {
        char ch = str.charAt(i);
        if (inUnicode) {
            unicode.append(ch);
            if (unicode.length() == 4) {
                // unicode now contains the four hex digits
                // which represents our unicode character
                try {
                    int value = Integer.parseInt(unicode.toString(), 16);
                    out.write((char) value);
                    unicode.setLength(0);
                    inUnicode = false;
                    hadSlash = false;
                } catch (NumberFormatException nfe) {
                    throw new NestableRuntimeException("Unable to parse unicode value: " + unicode, nfe);
                }
            }
            continue;
        }
        if (hadSlash) {
            hadSlash = false;
            if (ch == 'u') {
                inUnicode = true;
            } else {
                out.write("\\");
                out.write(ch);
            }
            continue;
        } else if (ch == '\\') {
            hadSlash = true;
            continue;
        }
        out.write(ch);
    }
    if (hadSlash) {
        // then we're in the weird case of a \ at the end of the
        // string, let's output it anyway.
        out.write('\\');
    }
}
Ondrej Sotolar
  • 1,352
  • 1
  • 19
  • 29
  • So I've spent some time trying out your Java solution, only needing to debug and learning that in the larger unescapeMessenger routine, at the top of the for loop, you have an *if (inUnicode)*, which you set to false right before the loop starts ... so nothing gets processed ... what's up with that? – Michael Sims Apr 30 '20 at 18:14
  • But the for loop block doesn't end with the first conditional block. The inUnicode variable is set to true in the second conditional block if we are on the 'u' character of the '\u' prefix. – Ondrej Sotolar May 06 '20 at 10:16
  • Well, it never worked for me, I parsed out the string a different way that was crude, but effective. – Michael Sims May 13 '20 at 16:15
1

Facebook programmers seem to have mixed up the concepts of Unicode encoding and escape sequences, probably while implementing their own ad-hoc serializer. Further details in Invalid Unicode encodings in Facebook data exports.

Try this:

import json
import io

class FacebookIO(io.FileIO):
    def read(self, size: int = -1) -> bytes:
        data: bytes = super(FacebookIO, self).readall()
        new_data: bytes = b''
        i: int = 0
        while i < len(data):
            # \u00c4\u0085
            # 0123456789ab
            if data[i:].startswith(b'\\u00'):
                u: int = 0
                new_char: bytes = b''
                while data[i+u:].startswith(b'\\u00'):
                    hex = int(bytes([data[i+u+4], data[i+u+5]]), 16)
                    new_char = b''.join([new_char, bytes([hex])])
                    u += 6

                char : str = new_char.decode('utf-8')
                new_chars: bytes = bytes(json.dumps(char).strip('"'), 'ascii')
                new_data += new_chars
                i += u
            else:
                new_data = b''.join([new_data, bytes([data[i]])])
                i += 1

        return new_data

if __name__ == '__main__':
    f = FacebookIO('data.json','rb')
    d = json.load(f)
    print(d)
kravietz
  • 10,667
  • 2
  • 35
  • 27
0

This is @Geekmoss' answer, but adapted for Python 3:

def parse_facebook_json(json_file_path):
    def parse_obj(obj):
        for key in obj:
            if isinstance(obj[key], str):
                obj[key] = obj[key].encode('latin_1').decode('utf-8')
            elif isinstance(obj[key], list):
                obj[key] = list(map(lambda x: x if type(x) != str else x.encode('latin_1').decode('utf-8'), obj[key]))
            pass
        return obj
    with json_file_path.open('rb') as json_file:
        return json.load(json_file, object_hook=parse_obj)

# Usage
parse_facebook_json(Path("/.../message_1.json"))
nicbou
  • 1,047
  • 11
  • 16
0

Extending Martijn solution #1, that I see it can lead towards recursive object processing (It certainly lead me initially):

You can apply this to the whole string of json object, if you don't ensure_ascii

json.dumps(obj, ensure_ascii=False, indent=2).encode('latin-1').decode('utf-8')

then write it to file or something.

PS: This should be comment on @Martijn answer: https://stackoverflow.com/a/50011987/1309932 (but I can't add comments)

danbaragan
  • 11
  • 1
  • 2
0

This is my approach for Node 17.0.1, based on @hotigeftas recursive code, using the iconv-lite package.

import iconv from 'iconv-lite';

function parseObject(object) {
  if (typeof object == 'string') {
    return iconv.decode(iconv.encode(object, 'latin1'), 'utf8');;
  }

  if (typeof object == 'object') {
    for (let key in object) {
      object[key] = parseObject(object[key]);
    }
    return object;
  }

  return object;
}

//usage
let file = JSON.parse(fs.readFileSync(fileName));
file = parseObject(file);
  • Your answer could be improved by adding more information on what the code does and how it helps the OP. – Tyler2P Nov 27 '21 at 10:41