1

I have a json structure like

{
    "a": "1",
    "b": "2",
    "c": {
        "d": "3"
    }
}

What I want is to only keep the 1st level of the json, i.e. remove if 1st level's value is not a string, so I have a program like

import json

s = ''' {
    "a": "1",
    "b": "2",
    "c": {
        "d": "3"
    } } '''

data = json.loads(s) 
ret = {}

for k, v in data.items():
    if (isinstance(v, basestring)):
        ret[k] = v

print json.dumps(ret)

Since I need to process huge amount of json string like that, I am looking for if any fastest way or more elegant way to do the same thing in Python

Ryan
  • 10,041
  • 27
  • 91
  • 156
  • be careful when you use json string verbatim inside a Python string literal. Use raw-string literal `r''` to avoid interpolating backslashes inside json. – jfs May 05 '14 at 17:26
  • if the question is about performance then you should provide a basic benchmark and determine how fast is fast enough in your case. – jfs May 05 '14 at 17:30

1 Answers1

4

Use a dict comprehension:

ret = {k: v for k, v in json.loads(s).iteritems() if isinstance(v, basestring)}

The dict.iteritems() call here prevents a full list being built first.

If your JSON input is truly huge, consider switching to an iterative JSON parser like ijson, and parse your JSON with an event-driven interface:

import ijson

ret = {}
key = None

with open(some_large_jsonfile) as json_file:
    for prefix, type, value in ijson.parse(json_file):
        if prefix and not '.' in prefix and type == 'string':
            # only top-level string values
            ret[prefix] = value

but it could be a good idea to process the key-value pairs right there and then rather than build up a full dictionary.

Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
  • my json is not huge, but I have many lines of json need to process. – Ryan May 05 '14 at 15:43
  • @Ryan: dict comprehension might be slower than an explicit loop (here's an [example where a generator expression (related concept) is slower than an explicit for-loop](http://stackoverflow.com/a/23318776/4279)). If an individual json object is small then it is not clear what would be faster a loop that uses `.iteritems()` or `.items()` (all items have to be created anyway, same logic as `xrange()` vs `range()`). Without a benchmark it is hard to say. `unicode` could be used instead of `basestring`. Do you mean that you have many small json objects (one per line) e.g., like tweet stream? – jfs May 05 '14 at 17:33
  • @J.F.Sebastian, my json is around 4K in size (average), anyway, I will do the benchmark first. Thanks. – Ryan May 08 '14 at 09:15