36

When screen-scraping some website, I extract data from <script> tags.
The data I get is not in standard JSON format. I cannot use json.loads().

# from
js_obj = '{x:1, y:2, z:3}'

# to
py_obj = {'x':1, 'y':2, 'z':3}

Currently, I use regex to transform the raw data to JSON format.
But I feel pretty bad when I encounter complicated data structure.

Do you have some better solutions?

Benjamin Loison
  • 3,782
  • 4
  • 16
  • 33
kev
  • 155,172
  • 47
  • 273
  • 272
  • What is non-standard about the data you want to parse? – huu Jun 04 '14 at 01:29
  • @HuuNguyen I want to parse `Plain old javascript data structure` to python object. – kev Jun 04 '14 at 01:32
  • Oh I didn't see that `js_obj` didn't have quotes around the keys. How complicated would your data structures get? It's hard to suggest anything without knowing the cases you're trying to solve for. – huu Jun 04 '14 at 01:34
  • @HuuNguyen `js_obj` maybe nested – kev Jun 04 '14 at 01:37
  • there are similar questions on SO already: http://stackoverflow.com/a/10057449/384442 none of them is offering any ready to use solution – RomanI Jun 04 '14 at 01:49

6 Answers6

59

demjson.decode()

import demjson

# from
js_obj = '{x:1, y:2, z:3}'

# to
py_obj = demjson.decode(js_obj)

chompjs.parse_js_object()

import chompjs

# from
js_obj = '{x:1, y:2, z:3}'

# to
py_obj = chompjs.parse_js_object(js_obj)

jsonnet.evaluate_snippet()

import json, _jsonnet

# from
js_obj = '{x:1, y:2, z:3}'

# to
py_obj = json.loads(_jsonnet.evaluate_snippet('snippet', js_obj))

ast.literal_eval()

import ast

# from
js_obj = "{'x':1, 'y':2, 'z':3}"

# to
py_obj = ast.literal_eval(js_obj)
Benjamin Loison
  • 3,782
  • 4
  • 16
  • 33
kev
  • 155,172
  • 47
  • 273
  • 272
12

Use json5

import json5

js_obj = '{x:1, y:2, z:3}'

py_obj = json5.loads(js_obj)

print(py_obj)

# output
# {'x': 1, 'y': 2, 'z': 3}
bikram
  • 7,127
  • 2
  • 51
  • 63
  • This is the best one :) – Ice Bear Feb 22 '22 at 11:34
  • **Caution**: unless you have very small object, don't use JSON5, it's explicitely stated in their documentation that it is slow. And they are not lying, it is very very slow even on average size JSON. Test it on a real usecase befeore adopting this. (I tested version 0.9.8) – cglacet Jul 25 '22 at 12:53
7

I'm facing the same problem this afternoon, and I finally found a quite good solution. That is JSON5.

The syntax of JSON5 is more similar to native JavaScript, so it can help you parse non-standard JSON objects.

You might want to check pyjson5 out.

Lyhokia
  • 73
  • 1
  • 3
4

This will likely not work everywhere, but as a start, here's a simple regex that should convert the keys into quoted strings so you can pass into json.loads. Or is this what you're already doing?

In[70] : quote_keys_regex = r'([\{\s,])(\w+)(:)'

In[71] : re.sub(quote_keys_regex, r'\1"\2"\3', js_obj)
Out[71]: '{"x":1, "y":2, "z":3}'

In[72] : js_obj_2 = '{x:1, y:2, z:{k:3,j:2}}'

Int[73]: re.sub(quote_keys_regex, r'\1"\2"\3', js_obj_2)
Out[73]: '{"x":1, "y":2, "z":{"k":3,"j":2}}'
chrisb
  • 49,833
  • 8
  • 70
  • 70
3

If you have node available on the system, you can ask it to evaluate the javascript expression for you, and print the stringified result. The resulting JSON can then be fed to json.loads:

def evaluate_javascript(s):
    """Evaluate and stringify a javascript expression in node.js, and convert the
    resulting JSON to a Python object"""
    node = Popen(['node', '-'], stdin=PIPE, stdout=PIPE)
    stdout, _ = node.communicate(f'console.log(JSON.stringify({s}))'.encode('utf8'))
    return json.loads(stdout.decode('utf8'))
Chris Billington
  • 855
  • 1
  • 8
  • 14
  • After trying other suggestions, I finally finish my problem with this solution. Thank you very much! – vaduc Aug 31 '23 at 09:53
2

Not including objects

json.loads()

  • json.loads() doesn't accept undefined, you have to change to null
  • json.loads() only accept double quotes
    • {"foo": 1, "bar": null}

Use this if you are sure that your javascript code only have double quotes on key names.

import json

json_text = """{"foo": 1, "bar": undefined}"""
json_text = re.sub(r'("\s*:\s*)undefined(\s*[,}])', '\\1null\\2', json_text)

py_obj = json.loads(json_text)

ast.literal_eval()

  • ast.literal_eval() doesn't accept undefined, you have to change to None
  • ast.literal_eval() doesn't accept null, you have to change to None
  • ast.literal_eval() doesn't accept true, you have to change to True
  • ast.literal_eval() doesn't accept false, you have to change to False
  • ast.literal_eval() accept single and double quotes
    • {"foo": 1, "bar": None} or {'foo': 1, 'bar': None}
import ast

js_obj = """{'foo': 1, 'bar': undefined}"""
js_obj = re.sub(r'([\'\"]\s*:\s*)undefined(\s*[,}])', '\\1None\\2', js_obj)
js_obj = re.sub(r'([\'\"]\s*:\s*)null(\s*[,}])', '\\1None\\2', js_obj)
js_obj = re.sub(r'([\'\"]\s*:\s*)NaN(\s*[,}])', '\\1None\\2', js_obj)
js_obj = re.sub(r'([\'\"]\s*:\s*)true(\s*[,}])', '\\1True\\2', js_obj)
js_obj = re.sub(r'([\'\"]\s*:\s*)false(\s*[,}])', '\\1False\\2', js_obj)

py_obj = ast.literal_eval(js_obj)