Parsing JSON using Pandas - issue with additional \ escape characters

Question

I'm downloading files from S3 that contains JSON (like) data which I intend to parse into a Pandas dataframe using pd.read_json.

My problem is that the files dumped into the S3 bucket use an 'octal escape' formatting for non english characters but Python/Pandas objects to the fact that an escape for the \ character is also included.

An example would be the string: "destination":"Provence-Alpes-C\\303\\264te d\'Azur"

Which prints as:

If I manually remove one of the \ characters then Python happily interprets the string and it prints as:

There is some good stuff in this thread and although .decode('string_escape') works well on an individual snippet, when its part of the much longer string comprising thousands of records then it doesn't work.

I believe that I need a clever way to replace the \\ with \ but for well documented reasons, .replace('\\', '\') doesn't work.

In order to get the files to work at all I used a regex to remove all \ followed by a number: re.sub(r'\\(?=[0-9])', '', g) - I'm thinking that an adaptation of this might be the way forward but the number needs to be dynamic as I don't know what it will be (i.e. using \3 and \2 for the example above isn't going to work')

Help appreciated.

Thanks Martin, this works well. Very rarely I'm getting an error - chr() arg not in range(256). I've tried to pinpoint exactly where this is falling over although its tough as these files are massive!! — Jacob, Aug 25 '15 at 15:48
I've been thinking of re-opening actually, because you have octal escapes for UTF-8 bytes, not for Unicode codepoints. Related, not the same. — Martijn Pieters, Aug 25 '15 at 16:02

score 2 · Accepted Answer · edited May 23 '17 at 12:22

Rather than have Python interpret \ooo octal escapes, repair the JSON with a regular expression, then parse it as JSON. I did so before in similar circumstances

Your data has UTF-8 bytes escaped to octal \ooo sequences, so you are looking for a more limited range of values here:

import re

invalid_escape = re.compile(r'\\([1-3][0-7]{2}|[1-7][0-7]?)')  # octal digits from 1 up to FF
def replace_with_codepoint(match):
    return chr(int(match.group(0)[1:], 8))

def repair(brokenjson):
    return invalid_escape.sub(replace_with_codepoint, brokenjson)

Demo:

>>> import json
>>> sample = '{"destination":"Provence-Alpes-C\\303\\264te d\'Azur"}'
>>> repair(sample)
'{"destination":"Provence-Alpes-C\xc3\xb4te d\'Azur"}'
>>> json.loads(repair(sample))
{u'destination': u"Provence-Alpes-C\xf4te d'Azur"}
>>> print json.loads(repair(sample))['destination']
Provence-Alpes-Côte d'Azur

Parsing JSON using Pandas - issue with additional \ escape characters

1 Answers1