One generic way to determine the structure of a very large file containing a single JSON entity would be to run the following query:
jq -nc --stream -f structural-paths.jq huge.json | sort -u
where structural_paths.jq
contains:
inputs
| select(length == 2)
| .[0]
| map( if type == "number" then 0 else . end )
Note that the '0's in the output signify that there is at least one valid array index at the corresponding position, not that '0' is actually a valid index at that position.
Note also that for very large files, using jq --stream to process the entire file could be quite slow.
Example:
Given {"a": {"b": [0,1, {"c":2}]}}
, the result of the above incantation would be:
["a","b",0,"c"]
["a","b",0]
Top-level structure
If you just want more information about the top-level structure, you could simplify the above jq program to:
inputs | select(length==1)[0][0] | if type == "number" then 0 else . end
Structure to a given depth
If the command-line sort
fails, then you might want to limit the number of paths by considering them only to a certain depth.
If the depth is not too great, then hopefully your command-line sort
will be able to manage; if not, then using the command-line uniq
would at least trim the output somewhat.
A better option might be to define unique(stream)
in jq, and then use it, as illustrated here:
# Output: a stream of the distinct `tostring` values of the items in the stream
def uniques(stream):
foreach (stream|tostring) as $s ({};
if .[$s] then .emit = false else .emit = true | .item = $s | .[$s]=true end;
if .emit then .item else empty end );
def spaths($depth):
inputs
| select(length==1)[0][0:$depth]
| map(if type == "number" then 0 else . end);
uniques(spaths($depth))
A suitable invocation of jq would then look like:
jq -nr --argjson depth 3 --stream -f structural-paths.jq huge.json
Beside avoiding the costs of sorting, using uniques/1
will preserve the ordering of paths in the original JSON.
"JSON Pointer" pointers
If you want to convert array path expressions to "JSON Pointer" strings (e.g. for use with jm
or jstream
), simply append the following to the relevant jq program:
| "/" + join("/")