0

I have a JSON file where strings are encoded in raw_unicode_escape (the file itself is UTF-8). How do I parse it so that strings will be UTF-8 in memory?

For individual properties, I could use the following code, but the JSON is very big and manually converting every string after parsing isn't an option.

# Contents of file 'file.json' ('\u00c3\u00a8' is 'è')
# { "name": "\u00c3\u00a8" }
with open('file.json', 'r') as input:
    j = json.load(input)
    j['name'] = j['name'].encode('raw_unicode_escape').decode('utf-8')

Since the JSON can be quite huge, the approach has to be "incremental" and I cannot read the whole file ahead of time, save it in a string and then do some processing.

Finally, I should note that the JSON is actually stored in a zip file, so instead of open() it's ZipFile.open().

Samuele Pilleri
  • 734
  • 1
  • 7
  • 17
  • You want your Python strings to be Python (Unicode) strings, plain and simple. You have no control over how Python manages its internal memory. – tripleee Nov 10 '18 at 19:09
  • Are the lines extremely long and do they contain valid JSON fragments? In other words, could you process a line at a time, perhaps with some provisions for returning the data to the format you want? – tripleee Nov 10 '18 at 21:12
  • @tripleee I don't think so. Actually being able to wrap a (progressive) `raw_unicode_escape` encoder around the input would do it. – Samuele Pilleri Nov 10 '18 at 21:18
  • 1
    Simply using `json.load` on `{ "name": "\u00c3\u00a8" }` should decode those characters perfectly fine. That encoding is part of the JSON spec, and will be decoded by a compliant JSON decoder. "Raw Unicode escapes" are a red herring, they're not your problem. – deceze Nov 11 '18 at 00:18
  • Having said that, `\u00c3\u00a8` is *not* `è`. Looks like the JSON *encoder* produced mojibake there. The encoder needs to be fixed! – deceze Nov 11 '18 at 00:19
  • @deceze Why don't you try it yourself? `json.load` doesn't handle this as expected or I wouldn't ask here in the first place... Also, `u'\u00c3\u00a8'.encode('raw_unicode_escape').decode('utf8')` _is_ `è`. – Samuele Pilleri Nov 11 '18 at 02:07
  • Well, I did. What *do* you expect and what exactly *does* it do? – deceze Nov 11 '18 at 02:29
  • @deceze Just read the title of the question and the first paragraph: it'll be enough. – Samuele Pilleri Nov 11 '18 at 03:09
  • 1
    Okay, again: your problem is not JSON. Your problem is that the JSON has not encoded the characters correctly and you have JSON-encoded mojibake, which you can fix with that workaround of yours. But the real fix should be wherever that JSON is coming from. Is that possible? Do you control the encoding side? Or can you at least contact that developer and ask them to fix their encoding? – deceze Nov 11 '18 at 03:35
  • 1
    I concur with deceze. If you can fix the thing which produces garbage like this, or preprocess the file separately to fix it up, you don't need to fix the reader. – tripleee Nov 11 '18 at 09:44
  • Unfortunately, I have no control on the encoding side and my workarounds don't really work and I'm still trying to figure out why. "The thing which produces garbage like this" is Facebook since this found in their profile dump file . – Samuele Pilleri Nov 11 '18 at 10:14

1 Answers1

1

Since codecs.open('file.json', 'r', 'raw_unicode_escape') works somehow, I took a look at its source code and came up with a solution.

>>> from codecs import getreader
>>>
>>> with open('file.json', 'r') as input:
...     reader = getreader('raw_unicode_escape')(input)
...     j = json.loads(reader.read().encode('raw_unicode_escape'))
...     print(j['name'])
...
è

Of course, that will work even if input is another type of file-like object, like a file inside a zip archive in my case.

Eventually, I've turned down the hypothesis of an incremental encoder (it doesn't make sense with JSONs, see), but for those interested I suggest taking a look at this answer as well as codecs.iterencode().

Samuele Pilleri
  • 734
  • 1
  • 7
  • 17