8

I'm looking for efficient means to search through an large JSON object for "sub-objects" that match a filter (via select(), I imagine). However, the top-level JSON is an object with arbitrary nesting contained within, including more simple values, objects and arrays of objects. For example:

{
  "name": "foo",
  "class": "system",
  "description": "top-level-thing",
  "configuration": {
    "status": "normal",
    "uuid": "id"
  },
  "children": [
    {
      "id": "c1",
      "class": "c1",
      "children": [
        {
          "id": "c1.1",
          "class": "c1.1"
        },
        {
          "id": "c1.1",
          "class": "FINDME"
        }
      ]
    },
    {
      "id": "c2",
      "class": "FINDME"
    }
  ],
  "thing": {
    "id": "c3",
    "class": "FINDME"
  }
}    

I have a solution which does part of what I want (and is understandable):

jq -r '.. | arrays | .[] | select(.class=="FINDME"?) | .id'

which returns:

c2
c1.1

... however, it misses c3, plus it changes the order of items output. Additionally I'm expecting this to operate on potentially very large JSON structures, I would like to make sure I find an efficient solution. Bonus points for something that remains readable by jq neophytes (myself included).

FWIW, references I was using to help me on the way, in case they help others:

peak
  • 105,803
  • 17
  • 152
  • 177
crimson-egret
  • 753
  • 7
  • 18

2 Answers2

8

For small to modest-sized JSON input, you're on the right track with .. but it seems you want to select objects, like so:

.. | objects | select(.class=="FINDME"?) | .id

For JSON documents that are very large, this might require too much memory, so it may be worth knowing about jq's streaming parser. Unfortunately it's much more difficult to use, so I'd suggest trying the above, and if you're interested, look in the usual places for documentation about the --stream option.

peak
  • 105,803
  • 17
  • 152
  • 177
3

Here's a streaming-parser solution. To make sense of it, you'll need to read up on the --stream option, but the key is that the output includes lines of the form: [PATH, VALUE]

program.jq

foreach inputs as $in (null;
  if has("id") and has("class") then null
  else . as $x
  | $in
  | if length != 2 then null
    elif .[0][-1] == "id" then ($x + {id: .[-1]})
    elif .[0][-1] == "class"
         and .[-1] == "FINDME" then  ($x + {class: .[-1]})
    else $x
    end
  end;
  select(has("id") and has("class")) | .id )

Invocation

jq -n --stream -f program.jq input.json

Output with sample input

"c1.1"
"c2"
"c3"
Inian
  • 80,270
  • 14
  • 142
  • 161
peak
  • 105,803
  • 17
  • 152
  • 177
  • While less readable than the other answer you gave, it does what I want,. including retaining the order, and I'll learn something from it's use. Thanks. – crimson-egret Dec 19 '17 at 18:01
  • Please note the update to remove the assumption. How about posting some details about your file size and comparative timings? – peak Dec 19 '17 at 18:13
  • Thanks for that assumption removing update. Since the output will likely be slightly different than my example, that bit is helpful. As for timing, I don't have real data sets yet, so I can't provide that. I might generate some simulated data sets, and if I do, I'll post a comparison then. – crimson-egret Dec 19 '17 at 23:01