47

Using Pythons (2.7) 'json' module I'm looking to process various JSON feeds. Unfortunately some of these feeds do not conform with JSON standards - in specific some keys are not wrapped in double speech-marks ("). This is causing Python to bug out.

Before writing an ugly-as-hell piece of code to parse and repair the incoming data, I thought I'd ask - is there any way to allow Python to either parse this malformed JSON or 'repair' the data so that it would be valid JSON?

Working example

import json
>>> json.loads('{"key1":1,"key2":2,"key3":3}')
{'key3': 3, 'key2': 2, 'key1': 1}

Broken example

import json
>>> json.loads('{key1:1,key2:2,key3:3}')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Python27\lib\json\__init__.py", line 310, in loads
    return _default_decoder.decode(s)
  File "C:\Python27\lib\json\decoder.py", line 346, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "C:\Python27\lib\json\decoder.py", line 362, in raw_decode
    obj, end = self.scan_once(s, idx)
ValueError: Expecting property name: line 1 column 1 (char 1)

I've written a small REGEX to fix the JSON coming from this particular provider, but I forsee this being an issue in the future. Below is what I came up with.

>>> import re
>>> s = '{key1:1,key2:2,key3:3}'
>>> s = re.sub('([{,])([^{:\s"]*):', lambda m: '%s"%s":'%(m.group(1),m.group(2)),s)
>>> s
'{"key1":1,"key2":2,"key3":3}'
Seidr
  • 4,946
  • 3
  • 27
  • 39

6 Answers6

33

You're trying to use a JSON parser to parse something that isn't JSON. Your best bet is to get the creator of the feeds to fix them.

I understand that isn't always possible. You might be able to fix the data using regexes, depending on how broken it is:

j = re.sub(r"{\s*(\w)", r'{"\1', j)
j = re.sub(r",\s*(\w)", r',"\1', j)
j = re.sub(r"(\w):", r'\1":', j)
Ned Batchelder
  • 364,293
  • 75
  • 561
  • 662
  • Thanks for your input - I highly doubt the provider will respond but I'll try and contact them. I also gave REGEX a try. I've edited my question to reflect my findings with REGEX. – Seidr Oct 27 '10 at 13:32
  • I'm going to leave this open for a while to see if anyone else has any further suggestions - otherwise I'll accept your answer. Looking at the REGEX statements you added they do pretty much the same thing as mine. – Seidr Oct 27 '10 at 14:20
  • 3
    Beware that while this regex might work on some very specific scenarios, it will **not** work more complex stuff like: `{ location: 'http://www.google.com' }`, you'll end up with invalid JSON: `{"location": "http"://www.google.com"}` – Marcos Dimitrio Jun 30 '15 at 01:10
17

Another option is to use the demjson module which can parse json in non-strict mode.

Joel
  • 1,426
  • 3
  • 16
  • 16
11

The regular expressions pointed out by Ned and cheeseinvert don't take into account when the match is inside a string.

See the following example (using cheeseinvert's solution):

>>> fixLazyJsonWithRegex ('{ key : "a { a : b }", }')
'{ "key" : "a { "a": b }" }'

The problem is that the expected output is:

'{ "key" : "a { a : b }" }'

Since JSON tokens are a subset of python tokens, we can use python's tokenize module.

Please correct me if I'm wrong, but the following code will fix a lazy json string in all the cases:

import tokenize
import token
from StringIO import StringIO

def fixLazyJson (in_text):
  tokengen = tokenize.generate_tokens(StringIO(in_text).readline)

  result = []
  for tokid, tokval, _, _, _ in tokengen:
    # fix unquoted strings
    if (tokid == token.NAME):
      if tokval not in ['true', 'false', 'null', '-Infinity', 'Infinity', 'NaN']:
        tokid = token.STRING
        tokval = u'"%s"' % tokval

    # fix single-quoted strings
    elif (tokid == token.STRING):
      if tokval.startswith ("'"):
        tokval = u'"%s"' % tokval[1:-1].replace ('"', '\\"')

    # remove invalid commas
    elif (tokid == token.OP) and ((tokval == '}') or (tokval == ']')):
      if (len(result) > 0) and (result[-1][1] == ','):
        result.pop()

    # fix single-quoted strings
    elif (tokid == token.STRING):
      if tokval.startswith ("'"):
        tokval = u'"%s"' % tokval[1:-1].replace ('"', '\\"')

    result.append((tokid, tokval))

  return tokenize.untokenize(result)

So in order to parse a json string, you might want to encapsulate a call to fixLazyJson once json.loads fails (to avoid performance penalties for well-formed json):

import json

def json_decode (json_string, *args, **kwargs):
  try:
    json.loads (json_string, *args, **kwargs)
  except:
    json_string = fixLazyJson (json_string)
    json.loads (json_string, *args, **kwargs)

The only problem I see when fixing lazy json, is that if the json is malformed, the error raised by the second json.loads won't be referencing the line and column from the original string, but the modified one.

As a final note I just want to point out that it would be straightforward to update any of the methods to accept a file object instead of a string.

BONUS: Apart from this, people usually likes to include C/C++ comments when json is used for configuration files, in this case, you can either remove comments using a regular expression, or use the extended version and fix the json string in one pass:

import tokenize
import token
from StringIO import StringIO

def fixLazyJsonWithComments (in_text):
  """ Same as fixLazyJson but removing comments as well
  """
  result = []
  tokengen = tokenize.generate_tokens(StringIO(in_text).readline)

  sline_comment = False
  mline_comment = False
  last_token = ''

  for tokid, tokval, _, _, _ in tokengen:

    # ignore single line and multi line comments
    if sline_comment:
      if (tokid == token.NEWLINE) or (tokid == tokenize.NL):
        sline_comment = False
      continue

    # ignore multi line comments
    if mline_comment:
      if (last_token == '*') and (tokval == '/'):
        mline_comment = False
      last_token = tokval
      continue

    # fix unquoted strings
    if (tokid == token.NAME):
      if tokval not in ['true', 'false', 'null', '-Infinity', 'Infinity', 'NaN']:
        tokid = token.STRING
        tokval = u'"%s"' % tokval

    # fix single-quoted strings
    elif (tokid == token.STRING):
      if tokval.startswith ("'"):
        tokval = u'"%s"' % tokval[1:-1].replace ('"', '\\"')

    # remove invalid commas
    elif (tokid == token.OP) and ((tokval == '}') or (tokval == ']')):
      if (len(result) > 0) and (result[-1][1] == ','):
        result.pop()

    # detect single-line comments
    elif tokval == "//":
      sline_comment = True
      continue

    # detect multiline comments
    elif (last_token == '/') and (tokval == '*'):
      result.pop() # remove previous token
      mline_comment = True
      continue

    result.append((tokid, tokval))
    last_token = tokval

  return tokenize.untokenize(result)
Community
  • 1
  • 1
psanchez
  • 301
  • 2
  • 6
  • Indeed, thanks, although to get it to work I had to also add `import StringIO` and change the line using StringIO to: `StringIO.StringIO(in_text)` from `StringIO(in_text)` Then it worked a treat on a lazy json that Google finance uses for delayed option chain quotes. – Bitfool Dec 23 '15 at 21:31
  • Thanks! I forgot to add the "from StringIO import StringIO" to the code that I pasted here. Now it is updated :) – psanchez Jan 14 '16 at 09:55
  • Dude, this is an absolute lifesaver. Thank you for posting this. – zorrotmm Nov 08 '16 at 11:12
6

Expanding on Ned's suggestion, the following has been helpful for me:

j = re.sub(r"{\s*'?(\w)", r'{"\1', j)
j = re.sub(r",\s*'?(\w)", r',"\1', j)
j = re.sub(r"(\w)'?\s*:", r'\1":', j)
j = re.sub(r":\s*'(\w+)'\s*([,}])", r':"\1"\2', j)
cheeseinvert
  • 301
  • 3
  • 8
  • That last line the first (\w) needs to be (\w*) since you're trying to match the whole word. – Chris Matta Mar 01 '13 at 18:30
  • Thanks Chris, I updated to \w+ since 0 char match wouldn't make sense – cheeseinvert Aug 29 '13 at 18:46
  • And, for those of us who accidentally create 'Pythonic' JSON with trailing comma: j = re.sub(r",\s*\]", "]", j) ... I didn't edit the answer since there may well be drawbacks that I haven't thought about. – Scott Lawton Jun 13 '15 at 03:35
1

In a similar case, I have used ast.literal_eval. AFAIK, this won't work only when the constant null (corresponding to Python None) appears in the JSON.

Given that you know about the null/None predicament, you can:

import ast
decoded_object= ast.literal_eval(json_encoded_text)
tzot
  • 92,761
  • 29
  • 141
  • 204
0

In addition to Neds and cheeseinvert suggestion, adding (?!/) should avoid the mentioned problem with urls

j = re.sub(r"{\s*'?(\w)", r'{"\1', j)
j = re.sub(r",\s*'?(\w)", r',"\1', j)
j = re.sub(r"(\w)'?\s*:(?!/)", r'\1":', j)
j = re.sub(r":\s*'(\w+)'\s*([,}])", r':"\1"\2', j) 
j = re.sub(r",\s*]", "]", j)
Stan
  • 1
  • 1
  • 1