1

I'm downloading files from S3 that contains JSON (like) data which I intend to parse into a Pandas dataframe using pd.read_json.

My problem is that the files dumped into the S3 bucket use an 'octal escape' formatting for non english characters but Python/Pandas objects to the fact that an escape for the \ character is also included.

An example would be the string: "destination":"Provence-Alpes-C\\303\\264te d\'Azur"

Which prints as:

enter image description here

If I manually remove one of the \ characters then Python happily interprets the string and it prints as:

enter image description here

There is some good stuff in this thread and although .decode('string_escape') works well on an individual snippet, when its part of the much longer string comprising thousands of records then it doesn't work.

I believe that I need a clever way to replace the \\ with \ but for well documented reasons, .replace('\\', '\') doesn't work.

In order to get the files to work at all I used a regex to remove all \ followed by a number: re.sub(r'\\(?=[0-9])', '', g) - I'm thinking that an adaptation of this might be the way forward but the number needs to be dynamic as I don't know what it will be (i.e. using \3 and \2 for the example above isn't going to work')

Help appreciated.

Community
  • 1
  • 1
Jacob
  • 3,437
  • 3
  • 18
  • 31
  • Thanks Martin, this works well. Very rarely I'm getting an error - chr() arg not in range(256). I've tried to pinpoint exactly where this is falling over although its tough as these files are massive!! – Jacob Aug 25 '15 at 15:48
  • See the gist I linked in my second comment. – Martijn Pieters Aug 25 '15 at 15:57
  • I've been thinking of re-opening actually, because you have octal escapes for UTF-8 bytes, not for Unicode codepoints. Related, not the same. – Martijn Pieters Aug 25 '15 at 16:02

1 Answers1

2

Rather than have Python interpret \ooo octal escapes, repair the JSON with a regular expression, then parse it as JSON. I did so before in similar circumstances

Your data has UTF-8 bytes escaped to octal \ooo sequences, so you are looking for a more limited range of values here:

import re

invalid_escape = re.compile(r'\\([1-3][0-7]{2}|[1-7][0-7]?)')  # octal digits from 1 up to FF
def replace_with_codepoint(match):
    return chr(int(match.group(0)[1:], 8))

def repair(brokenjson):
    return invalid_escape.sub(replace_with_codepoint, brokenjson)

Demo:

>>> import json
>>> sample = '{"destination":"Provence-Alpes-C\\303\\264te d\'Azur"}'
>>> repair(sample)
'{"destination":"Provence-Alpes-C\xc3\xb4te d\'Azur"}'
>>> json.loads(repair(sample))
{u'destination': u"Provence-Alpes-C\xf4te d'Azur"}
>>> print json.loads(repair(sample))['destination']
Provence-Alpes-Côte d'Azur
Community
  • 1
  • 1
Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343