5

I'm dealing with an API that unfortunately is returning malformed (or "weirdly formed," rather -- thanks @fjarri) JSON, but on the positive side I think it may be an opportunity for me to learn something about recursion as well as JSON. It's for an app I use to log my workouts, I'm trying to make a backup script.

I can received the JSON fine, but even after requests.get(api_url).json() (or json.loads(requests.get(api_url).text)), one of the values is still a JSON encoded string. Luckily, I can just json.loads() the string and it properly decodes to a dict. The specific key is predictable: timezone_id, whereas its value varies (because data has been logged in multiple timezones). For example, after decoding, it might be: dumped to file as "timezone_id": {\"name\":\"America/Denver\",\"seconds\":\"-21600\"}", or loaded into Python as 'timezone_id': '{"name":"America/Denver","seconds":"-21600"}'

The problem is that I'm using this API to retrieve a fair amount of data, which has several layers of dicts and lists, and the double encoded timezone_ids occur at multiple levels.

Here's my work so far with some example data, but it seems like I'm pretty far off base.

#! /usr/bin/env python3

import json
from pprint import pprint

my_input = r"""{
    "hasMore": false,
    "checkins": [
        {
            "timestamp": 1353193745000,
            "timezone_id": "{\"name\":\"America/Denver\",\"seconds\":\"-21600\"}",
            "privacy_groups": [
                "private"
            ],
            "meta": {
                "client_version": "3.0",
                "uuid": "fake_UUID"
            },
            "client_id": "fake_client_id",
            "workout_name": "Workout (Nov 17, 2012)",
            "fitness_workout_json": {
                "exercise_logs": [
                    {
                        "timestamp": 1353195716000,
                        "type": "exercise_log",
                        "timezone_id": "{\"name\":\"America/Denver\",\"seconds\":\"-21600\"}",
                        "workout_log_uuid": "fake_UUID"
                    },
                    {
                        "timestamp": 1353195340000,
                        "type": "exercise_log",
                        "timezone_id": "{\"name\":\"America/Denver\",\"seconds\":\"-21600\"}",
                        "workout_log_uuid": "fake_UUID"
                    }
                ]
            },
            "workout_uuid": ""
        },
        {
            "timestamp": 1354485615000,
            "user_id": "fake_ID",
            "timezone_id": "{\"name\":\"America/Denver\",\"seconds\":\"-21600\"}",
            "privacy_groups": [
                "private"
            ],
            "meta": {
                "uuid": "fake_UUID"
            },
            "created": 1372023457376,
            "workout_name": "Workout (Dec 02, 2012)",
            "fitness_workout_json": {
                "exercise_logs": [
                    {
                        "timestamp": 1354485615000,
                        "timezone_id": "{\"name\":\"America/Denver\",\"seconds\":\"-21600\"}",
                        "workout_log_uuid": "fake_UUID"
                    },
                    {
                        "timestamp": 1354485584000,
                        "timezone_id": "{\"name\":\"America/Denver\",\"seconds\":\"-21600\"}",
                        "workout_log_uuid": "fake_UUID"
                    }
                ]
            },
            "workout_uuid": ""
        }]}"""

def recurse(obj):
    if isinstance(obj, list):
        for item in obj:
            return recurse(item)
    if isinstance(obj, dict):
        for k, v in obj.items():
            if isinstance(v, str):
                try:
                    v = json.loads(v)
                except ValueError:
                    pass
                obj.update({k: v})
            elif isinstance(v, (dict, list)):
                return recurse(v)

pprint(json.loads(my_input, object_hook=recurse))

Any suggestions for a good way to json.loads() all those double-encoded values without changing the rest of the object? Many thanks in advance!

This post seems to be a good reference: Modifying Deeply-Nested Structures

Edit: This was flagged as a possible duplicate of this question -- I think its fairly different, as I've already demonstrated that using json.loads() was not working. The solution ended up requiring an object_hook, which I've never had to use when decoding json and is not addressed in the prior question.

Community
  • 1
  • 1
n8henrie
  • 2,737
  • 3
  • 29
  • 45
  • 2
    First your JSON is not *malformed*, it's *weirdly formed* (otherwise you wouldn't be able to load it). Second, I don't see any problems with your solution except that if you know that `timezone_id` is the culprit, why are you trying to `json.load()` every single string value? – fjarri Sep 04 '15 at 02:07
  • possible duplicate of [How to decode JSON with Python](http://stackoverflow.com/questions/2331943/how-to-decode-json-with-python) –  Sep 04 '15 at 02:20
  • @fjarri: So would one consider JSON malformed IFF it throws an exception? Is there a better way to describe JSON whose output is not the intended output -- or even the intended type? – n8henrie Sep 04 '15 at 14:24
  • JSON is malformed if it is not formed according to the standard, which is equivalent to the parser throwing an error (assuming the parser implements the standard correctly). JSON's "intended output" is whatever was stored in it, that is `json.load(json.dump(data_structure)) == data_structure`. – fjarri Sep 04 '15 at 14:32
  • Fair enough -- when I wrote the Q, I paused at that point and figured someone would educate me on the semantics (thanks, btw). In this case, it seems the format of the `data_structure` that's the problem, not the json, so describing it as "malformed json" is incorrect. – n8henrie Sep 04 '15 at 14:48

2 Answers2

5

So, the object_hook in the json loader is going to be called each time the json loader is finished constructing a dictionary. That is, the first thing it is called on is the inner-most dictionary, working outwards.

The dictionary that the object_hook callback is given is replaced by what that function returns.

So, you don't need to recurse yourself. The loader is giving you access to the inner-most things first by its nature.

I think this will work for you:

def hook(obj):
    value = obj.get("timezone_id")
    # this is python 3 specific; I would check isinstance against 
    # basestring in python 2
    if value and isinstance(value, str):
        obj["timezone_id"] = json.loads(value, object_hook=hook)
    return obj
data = json.loads(my_input, object_hook=hook)

It seems to have the effect I think you're looking for when I test it.

I probably wouldn't try to decode every string value -- I would strategically just call it where you expect there to be a json object double encoding to exist. If you try to decode every string, you might accidentally decode something that is supposed to be a string (like the string "12345" when that is intended to be a string returned by the API).

Also, your existing function is more complicated than it needs to be, might work as-is if you always returned obj (whether you update its contents or not).

Matt Anderson
  • 19,311
  • 11
  • 41
  • 57
  • Geez, I had initially thought that's how `object_hook` worked and tried something similar without luck. Thanks! I don't think the `json.loads` inside `hook()` needs the `object_hook`, does it? I suppose it *could* if the `timezone_id` had a triple-encoded value. – n8henrie Sep 04 '15 at 02:19
  • @n8henrie No, it probably isn't necessary for the inner `loads` to use an `object_hook`. Shouldn't hurt, and could matter for a more complex, n-level-nested structure. – Matt Anderson Sep 04 '15 at 02:22
  • Gah, should have known. Your code was working with my example above but still not working with the full API dataset. Ends up it was a python2 python3 error (your code comments clued me in). Decided to give Atom a shot, and for some reason it wasn't updating with the executable I had specified in its runner. A few `import sys; print(sys.version))`s and everything's set. I wonder if this is the reason I wasn't getting the `object_hook` to work correctly earlier (`type == unicode` instead of `str`?). Anyway, thanks again. – n8henrie Sep 04 '15 at 02:36
  • @n8henrie You're welcome. Yeah, I find it frustrating that on Python 2 the json loader always returns unicode and that's not configurable (even if you're writing and reading the data yourself, and you know it's all ascii). On Python 2 I usually test against `basestring` if I'm type checking for strings to catch both `str` and `unicode` and I don't really care which. – Matt Anderson Sep 04 '15 at 02:41
2

Your main issue is that your object_hook function should not be recursing. json.loads() takes care of the recursing itself and calls your function every time it finds a dictionary (aka obj will always be a dictionary). So instead you just want to modify the problematic keys and return the dict -- this should do what you are looking for:

def flatten_hook(obj):
    for key, value in obj.iteritems():
        if isinstance(value, basestring):
            try:
                obj[key] = json.loads(value, object_hook=flatten_hook)
            except ValueError:
                pass
    return obj

pprint(json.loads(my_input, object_hook=flatten_hook))

However, if you know the problematic (double-encoded) entry always take on a specific form (e.g. key == 'timezone_id') it is probably safer to just call json.loads() on those keys only, as Matt Anderson suggests in his answer.

lemonhead
  • 5,328
  • 1
  • 13
  • 25