Run the same jq pipeline on two files and compare the results?

Question

If I run comm -23 <(jq -r '.["@graph"][] |.["rdfs:label"] ' 9.0/schemaorg-all-http.jsonld|sort) <(jq -r '.["@graph"][] | .["rdfs:label"] ' 13.0/schemaorg-all-http.jsonld|sort) in the schema.org repo data/release directory then it works. It's hideous, on the other hand. Would it be possible to collapse it into a single jq command?

score 1 · Accepted Answer · edited Oct 12 '21 at 17:33

Can't say it's less hideous, but yeah, it is possible to do this entirely in JQ.

jq -nr '[inputs | [.["@graph"][]["rdfs:label"]]] | .[0]-.[1] | .[]' {9.0,13.0}/schemaorg-all-http.jsonld

Explanation:

Given two files, say x:

[
{"x": 1},
{"x": 2}
]

and y

[
{"x": 3},
{"x": 4}
]

jq -n 'inputs' x y produces

[
  {
    "x": 1
  },
  {
    "x": 2
  }
]
[
  {
    "x": 3
  },
  {
    "x": 4
  }
]

Observe how this is just two things, one after the other. We will eventually want to address them, so we wrap it in an array.

jq -n '[inputs]' x y

[
  [
    {
      "x": 1
    },
    {
      "x": 2
    }
  ],
  [
    {
      "x": 3
    },
    {
      "x": 4
    }
  ]
]

the reason for jq -n is in Why does `inputs` skip the first line of the input file?.

Once this is done and understood the rest is relatively easy.

.["@graph"] extracts @graph which is an array of objects,
[] iterates said array
["rdfs:label"] extracts the ["rdfs:label"] key from each object. Note there's no dot before ["rdfs:label"] and that's strange.
As we know inputs just outputs the input files, on after the other, recall how it is not an array. The pipe operator "if the one on the left produces multiple results, the one on the right will be run for each of those results" so inputs | .["@graph"][]["rdfs:label"] applies this extraction to each input.
However, the end result is again just multiple results for each file and we want to work on it so we need to collect it into an array. The manual says about about the array construction [] operator "You can use it to construct an array to "collect" all the results of a filter into an array (as in [.items[].name])" which is exactly what we are doing. Except we do not have something as nice as .name we have ["rdfs:label"] instead because of the colon we need to use this more verbose and much more confusing syntax. We now have inputs | [.["@graph"][]["rdfs:label"]] which will output an array for each input file. These arrays contain the value of rdfs:label and have the same number of strings as @graph had objects.
What should be familiar now is that we take this output and wrap it into an array so that it can be addressed later and so we arrive to [inputs | [.["@graph"][]["rdfs:label"]]]. Running each step and observing the output is helpful. You will get a two dimensional array -- the outer dimension is the number of files, the inside, as in the previous point, is the same number of strings as @graph had objects.
The dot for jq means "the whole input" so the first operator will take this array of arrays and take the first and second elements of it and calculate the difference.
Finally, instead of outputting this array, we want to output one thing after the other which will allow -r to strip the quotes. The .[] means to iterate over the difference array and output each element. It's the opposite operation to the [] array constructor.

Run the same jq pipeline on two files and compare the results?

1 Answers1