6

KISSmetrics generates invalid JSON strings I need to parse. I'm getting tons of errors like

ERROR 2013-03-04 04:31:12,253 Invalid \escape: line 1 column 132 (char 132): {"search engine":"Google","_n":"search engine hit","_p":"z392cpdpnm6silblq5mac8kiugq=","search terms":"happy new year animation 1920\303\2271080 hd","_t":1356390128}

ERROR 2013-03-04 04:34:19,153 Invalid \escape: line 1 column 101 (char 101): {"search engine":"Google","_n":"ad campaign hit","_p":"byskpczsw6sorbmzqi0tk1uimgw=","search terms":"\331\203\330\261\330\252\331\207 \331\201\331\212\330\257\331\212\330\244\331\211 \330\256\331\212\331\204\330\247\330\255\331\211 \331\203\331\210\330\261\330\257\331\211","_t":1356483052}

My code is:

for line in lines:
    try:
        data = self.clean_data(json.loads(line))
        except ValueError, e:
            logger.error('%s: %s' % (e.message, line))

Example raw data:

{"search engine":"Google","_n":"search engine hit","_p":"kvceh84hzbhywcnlivv+hdztizw=","search terms":"military sound effects programs","_t":1356034177}

Is there any chance to cleanup this messy JSON and parse it? Thanks for your help.

user202729
  • 3,358
  • 3
  • 25
  • 36
Michael Samoylov
  • 2,933
  • 3
  • 25
  • 33
  • How do you parse the JSON? What is the `repr()` of the value before decoding? – Martijn Pieters Mar 04 '13 at 09:58
  • Ah, your input data has *octal* escapes, it looks like. Those would be invalid JSON indeed. – Martijn Pieters Mar 04 '13 at 10:01
  • https://stackoverflow.com/questions/10480148/json-string-decoding-encountering-invalid-escape – cardamom Nov 08 '18 at 15:33
  • There's also a suggestion using `unicode-escape` in a duplicate question: https://stackoverflow.com/questions/43018576/valueerror-invalid-escape-when-readin-json-as-respons-in-scrapy – user202729 Feb 17 '21 at 11:31
  • Related: Missing double escape in windows file path: [python - json reading error json.decoder.JSONDecodeError: Invalid \escape - Stack Overflow](https://stackoverflow.com/questions/44687525/json-reading-error-json-decoder-jsondecodeerror-invalid-escape), hexadecimal escape [python - simplejson.loads() get Invalid \escape: 'x' - Stack Overflow](https://stackoverflow.com/questions/4296041/simplejson-loads-get-invalid-escape-x?noredirect=1&lq=1) – user202729 Feb 17 '21 at 11:56

3 Answers3

12

Your input data contains octal escapes; those would be invalid indeed. Replace them with decoded bytes using a regular expression:

import re

invalid_escape = re.compile(r'\\[0-7]{1,3}')  # up to 3 digits for byte values up to FF

def replace_with_byte(match):
    return chr(int(match.group(0)[1:], 8))

def repair(brokenjson):
    return invalid_escape.sub(replace_with_byte, brokenjson)

This makes your input work:

>>> data1 = r"""{"search engine":"Google","_n":"search engine hit","_p":"z392cpdpnm6silblq5mac8kiugq=","search terms":"happy new year animation 1920\303\2271080 hd","_t":1356390128}"""
>>> json.loads(repair(data1))
{u'_n': u'search engine hit', u'search terms': u'happy new year animation 1920\xd71080 hd', u'_p': u'z392cpdpnm6silblq5mac8kiugq=', u'_t': 1356390128, u'search engine': u'Google'}
>>> print json.loads(repair(data1))['search terms']
happy new year animation 1920×1080 hd
>>> data2 = r"""{"search engine":"Google","_n":"ad campaign hit","_p":"byskpczsw6sorbmzqi0tk1uimgw=","search terms":"\331\203\330\261\330\252\331\207 \331\201\331\212\330\257\331\212\330\244\331\211 \330\256\331\212\331\204\330\247\330\255\331\211 \331\203\331\210\330\261\330\257\331\211","_t":1356483052}"""
>>> json.loads(repair(data2)){u'_n': u'ad campaign hit', u'search terms': u'\u0643\u0631\u062a\u0647 \u0641\u064a\u062f\u064a\u0624\u0649 \u062e\u064a\u0644\u0627\u062d\u0649 \u0643\u0648\u0631\u062f\u0649', u'_p': u'byskpczsw6sorbmzqi0tk1uimgw=', u'_t': 1356483052, u'search engine': u'Google'}
>>> print json.loads(repair(data2))['search terms']
كرته فيديؤى خيلاحى كوردى
Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
2

Consider cjson for this exact scenario (https://pypi.python.org/pypi/python-cjson)

Seems to handle the escaped octals (and quick).

kermatt
  • 1,585
  • 2
  • 16
  • 36
0

I had similar problem and just replacing json library with yaml solved the issue. (YAML is compatible with JSON.)

Example:

import yaml

obj = yaml.load(json_string) # instead of json.loads(json_string)
Luke
  • 1,369
  • 1
  • 13
  • 37