Read large JSON dataset without writing down all specific names of every JSON sub section by using Python

Question

So I have the following JSON response that I get from a cloud database (this is a small part of it):

{'series': [{'name': 'G', 'type': 'c8y_g_ist', 'unit': ''},
            {'name': 'T', 'type': 'c8y_t_rohr_ist', 'unit': 'C'},
            {'name': 'P', 'type': 'c8y_ph_soll_ist', 'unit': 'ph'},
            {'name': 'T', 'type': 'c8y_t_tank_ist', 'unit': 'C'},
            {'name': 'V', 'type': 'c8y_v_rohr_ist', 'unit': 'm/s'},
            {'name': 'Bio', 'type': 'c8y_NO3_Wert', 'unit': 'mg/l'},
            {'name': 'T', 'type': 'c8y_t_rohrsurface_ist', 'unit': 'C'},
            {'name': 'Bio', 'type': 'c8y_PO4_Wert', 'unit': 'mg/l'},
            {'name': 'P', 'type': 'c8y_ph_ist', 'unit': 'ph'},
            {'name': 'Bio', 'type': 'OD_Wert', 'unit': ''}],
 'truncated': False,
 'values': {'2018-03-15T00:00:17.000Z': [{'max': 92.78, 'min': 92.78},
                                         {'max': 3.21, 'min': 3.21}],
            '2018-03-15T00:05:18.000Z': [None, {'max': 3.2, 'min': 3.2}],
            '2018-03-15T00:06:49.000Z': [{'max': 92.78, 'min': 92.78},
                                         {'max': 3.2, 'min': 3.2},
                                         {'max': 5, 'min': 5},
                                         {'max': 3.64, 'min': 3.64},
                                         {'max': 0, 'min': 0},
                                         {'max': 0, 'min': 0},
                                         {'max': 3.04, 'min': 3.04},
                                         {'max': 0, 'min': 0},
                                         {'max': 0, 'min': 0},
                                         {'max': 0, 'min': 0}],
            '2018-03-15T00:10:17.000Z': [{'max': 95.22, 'min': 95.22},
                                         {'max': 3.14, 'min': 3.14},
                                         None,
                                         {'max': 3.57, 'min': 3.57},
                                         {'max': 0.01, 'min': 0.01},
                                         None,
                                         {'max': 2.97, 'min': 2.97},
                                         None,
                                         None,
                                         None],
            '2018-03-15T00:15:17.000Z': [{'max': 92.78, 'min': 92.78},
                                         {'max': 3.13, 'min': 3.13},
                                         None,
                                         None,
                                         {'max': 0, 'min': 0},
                                         None,
                                         None,
                                         None,
                                         None,
                                         None],

Now I can read a value, for example the 92.78 that is in this line: 'values': {'2018-03-15T00:00:17.000Z': [{'max': 92.78, 'min': 92.78},

With:

SingleValue = get_mint_json['values']['2018-03-15T00:00:17.000Z'][0]['max']
SingleValue = json.dumps(SingleValue)
pprint.pprint(SingleValue)

(here get_mint_json is what the JSON response is stored as) The response of this is as expected: '92.78'

Now my question: I want to do this for all the data points, so not only for 2018-03-15T00:00:17.000Z but also for 2018-03-15T00:05:18.000Z and 2018-03-15T00:06:49.000Z etc. How do I do that without having to hardcode all of these specific names?

Are you looking for how to do this in a loop for all date values, as in `values = [datedvalue[0]['max'] for datedvalue in get_mint_json['values']]`? If so (and if you can't immediately understand how/why that listcomp works), there's a duplicate question with a nice answer I can dig up. — abarnert, Mar 23 '18 at 17:27
Also, you may want to look at something like [`dpath`](https://pypi.python.org/pypi/dpath) or the Python port of ObjC KVC paths that I can't remember the name of or some similar library, which will let you write this as something like `search(get_mint_json, 'values/*/0/max')` — abarnert, Mar 23 '18 at 17:31
Yess, looping like that is exactly what I'm looking for! Would be great if you could help dig up that duplicate question. Thanks for the dpath tip, i'll take a look at that. — JorDik, Mar 23 '18 at 18:03
Still looking for the canonical dup, but meanwhile, try these: [1](https://stackoverflow.com/questions/16548917/decoding-nested-json-with-multiple-for-loops), [2](https://stackoverflow.com/questions/47564015/looping-over-nested-json-elements), [3](https://stackoverflow.com/questions/46529929/are-nested-loops-required-when-processing-json-response). — abarnert, Mar 23 '18 at 18:17
Neat! I used `dpath.util.search(get_mint_json, 'values/*/0/max')`. And that gave `{ "values": { "2018-03-15T00:00:17.000Z": [ { "max": 92.78 } ], "2018-03-15T00:06:49.000Z": [ { "max": 92.78 } ],` Only need to convert that to a CSV format but that shouldn't be too hard. Thanks a lot for your help!! — JorDik, Mar 23 '18 at 18:21

score 0 · Answer 1 · answered Mar 23 '18 at 18:38

What you need to do here is loop over every member of get_mint_json['values'] to get get_mint_json['values']['2018-03-15T00:00:17.000Z'], get_mint_json['values']['2018-03-15T00:05:18.000Z'], etc., and then, for each one, just access its [0][max].

Once you recognize that's your problem, it's pretty easy to write it up as a for loop:

for date, datedvalue in get_mint_json['values'].items():
    finalvalue = datedvalue[0]['max']
    do_something(finalvalue)

Or, since you don't need the key-value pairs here, just the values:

for datedvalue in get_mint_json['values'].values():
    finalvalue = datedvalue[0]['max']
    do_something(finalvalue)

However, notice that in your case, sometimes datedvalue[0] may not be a dict with a max key; it may be None. You'll probably want to handle that:

for datedvalue in get_mint_json['values'].values():
    if datedvalue[0]:
        finalvalue = datedvalue[0]['max']
        do_something(finalvalue)

You can also collapse this into a list comprehension or generator expression:

finalvalues = (datedvalue[0]['max'] 
               for datedvalue in get_mint_json['values'].values()
               if datedvalue[0])

The same thing works if you need to loop over a list instead of a dict's keys/values/key-value pairs, of course. And it works with multiple levels of nested, just by writing one nested for loop for each level. For example, if you wanted every datedvalue[(anything)]['max'], not just datedvalue[0]['max']:

for datedvalue in get_mint_json['values'].values():
    for individualvalue in datedvalue:
        if individualvalue:
            finalvalue = individualvalue['max']
            do_something(finalvalue)

You can write that as a comprehension too, but at this point it's probably getting way too complex for it to be readable that way.

Another way to solve this problem is to use a library that offers "query paths", equivalent to XPath for XML (as used by ElementTree) or KVC paths for Objective C. One example is the dpath library, which uses XPath-style syntax:

finalvalues = dpath.util.search('values/*/0/max')

Or, for the doubly-nested version:

finalvalues = dpath.util.search('values/*/*/max')

Thanks for this! Helped me understand a lot about how this works! I've tried both ways of doing it and it worked. There's only 1 problem, every row of the output is completely randomized. Any idea how I could fix that? — JorDik, Mar 29 '18 at 14:48
@JorDik Are you talking about the fact that the elements of each dict are in arbitrary order? That’s a consequence of the fact that JSON explicitly defines objects as being unordered—whatever order you happen to see in the file is arbitrary, and could be different the next time you export the same data. So most Python libraries store them in a dict, which is also unordered. If you care about the order in the file, you have to tell your JSON library to use an OrderedDict and respect the order in the file even though it’s meaningless. The stdlib module docs explain how to do it with that module. — abarnert, Mar 29 '18 at 16:02

Read large JSON dataset without writing down all specific names of every JSON sub section by using Python

1 Answers1