11

Given the following json:

{
    "README.rst": {
        "_status": {
            "md5": "952ee56fa6ce36c752117e79cc381df8"
        }
    },
    "docs/conf.py": {
        "_status": {
            "md5": "6e9c7d805a1d33f0719b14fe28554ab1"
        }
    }
}

is there a query language that can produce:

{
    "README.rst": "952ee56fa6ce36c752117e79cc381df8",
    "docs/conf.py": "6e9c7d805a1d33f0719b14fe28554ab1",
}

My best attempt so far with JMESPath (http://jmespath.org/) isn't very close:

>>> jmespath.search('*.*.md5[]', db)
['952ee56fa6ce36c752117e79cc381df8', '6e9c7d805a1d33f0719b14fe28554ab1']

I've gotten to the same point with ObjectPath (http://objectpath.org):

>>> t = Tree(db)
>>> list(t.execute('$..md5'))
['952ee56fa6ce36c752117e79cc381df8', '6e9c7d805a1d33f0719b14fe28554ab1']

I couldn't make any sense of JSONiq (do I really need to read a 105 page manual to do this?) This is my first time looking at json query languages..

Ghislain Fourny
  • 6,971
  • 1
  • 30
  • 37
thebjorn
  • 26,297
  • 11
  • 96
  • 138
  • so you might be searching a key other than md5? and it can be arbitrarily deep? – Joran Beasley Sep 12 '15 at 00:36
  • Does this happen to be in the context of a Django application? This can be done via rest_framework serializers but that is overkill outside of that context. – moorecm Sep 18 '15 at 16:29

6 Answers6

6

not sure why you want a query language this is pretty easy

def find_key(data,key="md5"):
    for k,v in data.items():
       if k== key: return v
       if isinstance(v,dict):
          result = find_key(v,key)
          if result:return result

dict((k,find_key(v,"md5")) for k,v in json_result.items()) 

it's even easier if the value dict always has "_status" and "md5" as keys

dict((k,v["_status"]["md5"]) for k,v in json_result.items()) 

alternatively I think you could do something like

t = Tree(db)
>>> dict(zip(t.execute("$."),t.execute('$..md5'))

although I dont know that it would match them up quite right ...

thebjorn
  • 26,297
  • 11
  • 96
  • 138
Joran Beasley
  • 110,522
  • 12
  • 160
  • 179
  • 1
    I want a query language so I can give end-users ad-hoc search capabilities in a very large .json file. – thebjorn Sep 12 '15 at 00:19
  • I think you might want to look up ad-hoc ;-) Each file path has a number of "categories" as direct children (only `_status` in this example), and each "categry" has data (potentially deeply nested). I would like to create a command line tool that gives end-users (developers) a way to query this structure (think structured grep). – thebjorn Sep 12 '15 at 00:26
  • ad hoc ˌad ˈhäk/ adjective & adverb formed, arranged, or done for a particular purpose only. – Joran Beasley Sep 12 '15 at 00:28
  • ahhh I think i now understand the pairing to md5 was only for demonstration and you want them to be able to arbirarilly decide what key to search for ? I still think the best solution is for you to write a python script ... (which is definitely an ad-hoc solution) ... – Joran Beasley Sep 12 '15 at 00:30
  • I was using it in the "not pre-planned" meaning (http://dictionary.cambridge.org/dictionary/english/ad-hoc) – thebjorn Sep 12 '15 at 00:31
  • Any script I write would be similar to defining a query language, so instead of re-inventing the wheel I figured I'd use something existing and well-tested. For xml one would probably use xquery, even if it is perfectly possible to hand-parse xml too. – thebjorn Sep 12 '15 at 00:33
  • I see i was using it in the general CS term that means a solution designed for a particular purpose (ie pairing level0 json keys with deeper nested keys) but tbh if they are developers and dont know how to search through a json dict on their own they are not much of a developer – Joran Beasley Sep 12 '15 at 00:33
  • Well.. take a look at the output for `radon cc -s -j `, it's rather volumnious. Being able to pipe it through a declarative query makes things easier. – thebjorn Sep 12 '15 at 00:36
  • Its a pretty simple algorithm above that should work ... if you take the arguments from argv ... I think that this is so domain specific you will not find a prebuild solution ... if it was a hard algorithm i can see your concern ... but its not ... your biggest concern is if it nests dictionaries more than 1000 layers deep you may encounter a recursion overflow error – Joran Beasley Sep 12 '15 at 00:45
  • I;ll upvote the question in the hopes that I am wrong and there is totally a way to do this with some other third party library ... – Joran Beasley Sep 12 '15 at 00:57
4

Here is the JSONiq code that does the job:

{|
    for $key in keys($document)
    return {
        $key: $document.$key._status.md5
    }
|}

You can execute it here with the Zorba engine.

If the 105-page manual you mention is the specification, I do not recommend reading it as a JSONiq user. I would rather advise reading tutorials or books online, which give a more gentle introduction.

Ghislain Fourny
  • 6,971
  • 1
  • 30
  • 37
  • Thanks! Can you suggest any tutorials that goes a bit in-depth? – thebjorn Oct 12 '15 at 17:01
  • Hi thebjorn, Sure! I revived an old tutorial [here](https://github.com/ghislainfourny/jsoniq-tutorial) and updated it to the latest syntax. Also, there are a couple of online talks on youtube by Jonathan Robie and Chris Hillery. – Ghislain Fourny Oct 14 '15 at 09:18
3

Do in ObjectPath:

l = op.execute("[keys($.*), $..md5]")

you'll get:

[
  [
    "README.rst",
    "docs/conf.py"
  ],
  [
    "952ee56fa6ce36c752117e79cc381df8",
    "6e9c7d805a1d33f0719b14fe28554ab1"
  ]
]

then in Python:

dict(zip(l[0],l[1]))

to get:

{
    'README.rst': '952ee56fa6ce36c752117e79cc381df8', 
    'docs/conf.py': '6e9c7d805a1d33f0719b14fe28554ab1'
}

Hope that helps. :)

PS. I'm using OPs' keys() to show how to make full query that works anywhere in the document not only when keys are in the root of document.

PS2. I might add new function so that it would look like: object([keys($.*), $..md5]). Shoot me tweet http://twitter.com/adriankal if you want that.

2

A solution that implements a new query language:

def keylist(db):
    "Return all the keys in db."

    def _keylist(db, prefix, res):
        if prefix is None:
            prefix = []

        for key, val in db.items():
            if isinstance(val, dict):
                _keylist(val, prefix + [key], res)
            else:
                res.append(prefix + [key])

    res = []
    _keylist(db, [], res)
    return ['::'.join(key) for key in res]

def get_key(db, key):
    "Get path and value from key."

    def _get_key(db, key, path):
        k = key[0]
        if len(key) == 1:
            return path + [k, db[k]]
        return _get_key(db[k], key[1:], path + [k])

    return _get_key(db, key, [])

def search(query, db):
    "Convert query to regex and use it to search key space."
    keys = keylist(db)
    query = query.replace('*', r'(?:.*?)')
    matching = [key for key in keys if re.match(query, key)]
    res = [get_key(db, key.split('::')) for key in matching]
    return dict(('::'.join(r[:-1]), r[-1]) for r in res)

which gives me something that's pretty close to the requirements:

>>> pprint.pprint(search("*::md5", db))
{'README.rst::_status::md5': '952ee56fa6ce36c752117e79cc381df8',
 'docs/conf.py::_status::md5': '6e9c7d805a1d33f0719b14fe28554ab1'}

and a query language that looks like a glob/re hybrid (if we're making a new language, at least make it look familiar):

>>> pprint.pprint(search("docs*::md5", db))
{'docs/conf.py::_status::md5': '6e9c7d805a1d33f0719b14fe28554ab1'}

since the data contains file paths I've randomly used :: as a path separator. (I'm pretty sure it doesn't handle the full json grammar yet, but that should be mostly grunt work).

thebjorn
  • 26,297
  • 11
  • 96
  • 138
  • i think its kinda funny that you didnt want to implement a new query language(at least partially because you were worried about bugs...) but then you use a much more complex algorithm than what i was suggesting(well somewhat more complex) ... that said if it solves your issue I am glad :) +1 for an answer that solves your problem – Joran Beasley Sep 12 '15 at 02:43
  • Conceptually it's simple, it just transforms the problem into a form where the re module can be used to solve the search problem. Serializing the key-space and retrieving a tree-path are both relatively easy to write test for, so I'm relatively comfortable with the code quality issues. It is a very limited query language, not at all on the same level as ObjectPath or JMESPath, and as a _solution_ I think it is unacceptable (why would anyone want to learn my very limited query language?) I'll set a bounty on it as soon as I'm able to see if there are any solutions using standard tools.. – thebjorn Sep 12 '15 at 16:32
  • a better question might be why would a __developer__ want to learn a more complicated query language like JMES when about 4 lines of relativly straightforward python would suffice .... unless when you say its for developers you dont mean software developer .. – Joran Beasley Sep 12 '15 at 17:08
  • I'm guessing because declarative is better than imperative. It's why we learn xpath and xquery, it's why `document.querySelector()` was added, not to mention sql. Describing _what_ you want instead of _how_ to get it turns out to be very useful. And 1 (short) line of a query language is more than a 80% savings in developer effort -- and that's just for a very simple case. For a more complex case, coming back to `document.querySelectorAll("h2 ~ h2 + [href^=https] > img")` after six months is much friendlier than the equivalent recursive tree-traversal code. – thebjorn Sep 12 '15 at 18:23
2

Missed the python requirement, but if you are willing to call external program, this will still work. Please note, that jq >= 1.5 is required for this to work.

# If single "key" $p[0] has multiple md5 keys, this will reduce the array to one key.
cat /tmp/test.json | \
jq-1.5 '[paths(has("md5")?) as $p | { ($p[0]): getpath($p)["md5"]}] | add '

# this will not create single object, but you'll see all key, md5 combinations
cat /tmp/test.json | \
jq-1.5 '[paths(has("md5")?) as $p | { ($p[0]): getpath($p)["md5"]}] '

Get paths with "md5"-key '?'=ignore errors (like testing scalar for key). From resulting paths ($p) filter and surround result with '{}' = object. And then those are in an array ([] surrounding the whole expression) which is then "added/merged" together |add

https://stedolan.github.io/jq/

Manwe
  • 401
  • 2
  • 11
  • This looks interesting, but I don't think your query is correct... this seems like it will find all leaf keys (not just md5) and if there are two leaf keys it will pick the value of the last one... – thebjorn Sep 19 '15 at 13:24
  • This is the closest I've gotten (also wrong..) `[[paths(..) | select(index("md5"))] as $p | {(($p[])[0]): getpath($p[])}] | add`. – thebjorn Sep 19 '15 at 14:38
  • @thebjorn You are right. First code had a "brain freeze and the select did nothing. I've now edited and tested with more complex test file. I'm not really a 'jq' expert, but it's been a really usefull tool. – Manwe Sep 21 '15 at 12:51
  • New code removed the leaf requirement from search key. test with: `jq-1.5 'paths(has("_status")?) as $p | { ($p[0]): getpath($p)["_status"]} '` – Manwe Sep 21 '15 at 12:56
  • Excellent! (and also the only answer that answered the question asked ) – thebjorn Sep 21 '15 at 19:06
1

If your json is well structured, ie. assured you'll have _status and md5 sub-elements, you could just load the json up and use a list comprehension to spit out the items you're looking for.

>>> import json
>>> my_json = json.loads(json_string)
>>> print [(key, value['_status']['md5']) for key, value in my_json.iteritems()]
[(u'README.rst', u'952ee56fa6ce36c752117e79cc381df8'), (u'docs/conf.py', u'6e9c7d805a1d33f0719b14fe28554ab1')]
Christian Witts
  • 11,375
  • 1
  • 33
  • 46