1

I have some huge JSON files I need to profile so I can transform them into some tables. I found jq to be really useful in inspecting them, but there are going to be hundreds of these, and I'm pretty new to jq.

I already have some really handy functions in my ~/.jq (big thank you to @mikehwang)

def profile_object:
    to_entries | def parse_entry: {"key": .key, "value": .value | type}; map(parse_entry)
        | sort_by(.key) | from_entries;

def profile_array_objects:
    map(profile_object) | map(to_entries) | reduce .[] as $item ([]; . + $item) | sort_by(.key) | from_entries;

I'm sure I'll have to modify them after I describe my question.

I'd like a jq line to profile a single object. If a key maps to an array of objects then collect the unique keys across the objects and keep profiling down if there are nested arrays of objects there. If a value is an object, profile that object.

Sorry for the long example, but imagine several GBs of this:

{
    "name": "XYZ Company",
    "type": "Contractors",
    "reporting": [
        {
            "group_id": "660",
            "groups": [
                {
                    "ids": [
                        987654321,
                        987654321,
                        987654321
                    ],   
                    "market": {
                        "name": "Austin, TX",
                        "value": "873275"
                    }
                },
                {
                    "ids": [
                        987654321,
                        987654321,
                        987654321
                    ],   
                    "market": {
                        "name": "Nashville, TN",
                        "value": "2393287"
                    }
                }
            ]
        }
    ],
    "product_agreements": [
        {
            "negotiation_arrangement": "FFVII",
            "code": "84144",
            "type": "DJ",
            "type_version": "V10",
            "description": "DJ in a mask",
            "name": "Claptone",
            "negotiated_rates": [
                {
                    "company_references": [
                        1,
                        5,
                        458
                    ],
                    "negotiated_prices": [
                        {
                            "type": "negotiated",
                            "rate": 17.73,
                            "expiration_date": "9999-12-31",
                            "code": [
                                "11"
                            ],
                            "billing_modifier_code": [
                                "124"
                            ],
                            "billing_class": "professional"
                        }
                    ]
                },
                {
                    "company_references": [
                        747
                    ],
                    "negotiated_prices": [
                        {
                            "type": "fee",
                            "rate": 28.42,
                            "expiration_date": "9999-12-31",
                            "code": [
                                "11"
                            ],
                            "billing_class": "professional"
                        }
                    ]
                }
            ]
        },
        {
            "negotiation_arrangement": "MGS3",
            "name": "David Byrne",
            "type": "Producer",
            "type_version": "V10",
            "code": "654321",
            "description": "Frontman from Talking Heads",
            "negotiated_rates": [
                {
                    "company_references": [
                        1,
                        9,
                        2344,
                        8456
                    ],
                    "negotiated_prices": [
                        {
                            "type": "negotiated",
                            "rate": 68.73,
                            "expiration_date": "9999-12-31",
                            "code": [
                                "11"
                            ],
                            "billing_class": "professional"
                        }
                    ]
                },
                {
                    "company_references": [
                        679
                    ],
                    "negotiated_prices": [
                        {
                            "type": "fee",
                            "rate": 89.25,
                            "expiration_date": "9999-12-31",
                            "code": [
                                "11"
                            ],
                            "billing_class": "professional"
                        }
                    ]
                }
            ]
        }
    ],
    "version": "1.3.1",
    "last_updated_on": "2023-02-01"
}

Desired output:

{
    "name": "string",
    "type": "string",
    "reporting": [
      {
        "group_id": "number",
        "groups": [
            {
                "ids": [
                    "number"
                ],
                "market": {
                    "type": "string",
                    "value": "string"
                }
            }
        ]
      }
    ],
    "product_agreements": [
      {
        "negotiation_arrangement": "string",
        "code": "string",
        "type": "string",
        "type_version": "string",
        "description": "string",
        "name": "string",
        "negotiated_rates": [
          {
            "company_references": [
                "number"
            ],
            "negotiated_prices": [
              {
                "type": "string",
                "rate": "number",
                "expiration_date": "string",
                "code": [
                  "string"
                ],
                "billing_modifier_code": [
                  "string"
                ],
                "billing_class": "string"
              }
            ]
          }
        ]        
      }
    ],
    "version": "string",
    "last_updated_on": "string"
}

Really sorry if there's any errors in that, but I tried to make it all consistent and about as simple as I could.

To restate the need, recursively profile each key in a JSON object if a value is an object or array. Solution needs to be key name independent. Happily to clarify further if needed.

Trey Brooks
  • 29
  • 1
  • 3
  • 1
    Note that even though the value at `.reporting[].group_id` consist of digits only, it is still a string because it is wrapped in quotes (unlike "real" numbers, for instance at `.reporting[].groups[].ids[]`). – pmf Feb 19 '23 at 16:37
  • For JSON documents that are too big to fit in RAM, one way to obtain a kind of "profile" is using jq's streaming parser (--stream), as shown in some of the entries at https://stackoverflow.com/questions/41491773/browsing-large-json-file/ – peak Feb 23 '23 at 02:32

3 Answers3

1

The jq module schema.jq at https://gist.github.com/pkoppstein/a5abb4ebef3b0f72a6ed Was designed to produce the kind of structural schema you describe.

For very large inputs, it might be very slow, so if the JSON is sufficiently regular, it might be possible to use a hybrid strategy - profiling enough of the data to come up with a comprehensive structural schema, and then checking that it does apply.

For conformance testing of structural schemas such as produced by schema.jq, see https://github.com/pkoppstein/JESS

peak
  • 105,803
  • 17
  • 152
  • 177
1

Given your input.json, here is a solution :

jq '
def schema:
    if   type == "object" then .[] |= schema
    elif type == "array"  then map(schema)|unique
         | if (first | type) == "object" then [add] else . end
    else type
    end;
schema
' input.json
Philippe
  • 20,025
  • 2
  • 23
  • 32
  • Using `[first|schema]' for arrays trivializes the general problem, though it might be useful in practice, especially if `map(schema) | unique` is a singleton for all arrays. In fact, for a potentially less misleading solution, it would be worth considering replacing `[first|schema]` by `map(schema)|unique` – peak Feb 19 '23 at 23:00
  • @peak Indeed, updated. – Philippe Feb 20 '23 at 00:28
  • Wow, yeah this worked! Thank you. @peak you're right, and your modification does well. However, could it merge inner sibling objects together such that keys that are missing from one, would be present in the schema? – Trey Brooks Feb 20 '23 at 00:29
  • Oh, and this dies when I run it on a 14GB file. It just says 'killed' lol – Trey Brooks Feb 20 '23 at 00:35
  • My initial solution produces exactly your desired output, for the input you gave. I have not tested on big files, or different schemas for elements of the same array. – Philippe Feb 20 '23 at 00:37
  • @TreyBrooks Updated to merge immediate inner sibling objects for arrays. – Philippe Feb 20 '23 at 00:58
  • I appreciate it, but it's not solving my issue. I've updated my example to try to mimic what's going on. Notice that `billing_modifier_code` key. Obviously things like that probably aren't going to be in the first element. – Trey Brooks Feb 20 '23 at 01:37
  • I'm not sure what you meant by `Notice that billing_modifier_code key`, as my code generates `"billing_modifier_code": [ "string" ]`, as you were expecting. – Philippe Feb 20 '23 at 10:01
0

Here's a variant of @Philippe's solution: it coalesces objects in map(schema) for arrays in a principled though lossy way. (All these half-solutions trade speed for loss of precision.)

Note that keys_unsorted is used below; if using gojq, then either this would have to be changed to keys, or a def of keys_unsorted provided.

# Use "JSON" as the union of two distinct types
# except combine([]; [ $x ]) => [ $x ]
def combine($a;$b):
  if $a == $b then $a elif $a == null then $b elif $b == null then $a
  elif ($a == []) and ($b|type) == "array" then $b
  elif ($b == []) and ($a|type) == "array" then $a
  else "JSON"
  end;

# Profile an array by calling mergeTypes(.[] | schema)
# in order to coalesce objects
def mergeTypes(s):
    reduce s as $t (null;
       if ($t|type) != "object" then .types = (.types + [$t] | unique)
       else .object as $o
       | .object = reduce ($t | keys_unsorted[]) as $k ($o;
                    .[$k] = combine( $t[$k]; $o[$k] ) 
          )
       end)
       | (if .object then [.object] else null end ) + .types ;

def schema:
    if   type == "object" then .[] |= schema
    elif type == "array"
    then if . == [] then [] else mergeTypes(.[] | schema) end
    else type
    end;
schema

Example: Input:

{"a": [{"b":[1]}, {"c":[2]}, {"c": []}] }

Output:

{
  "a": [
    {
      "b": [
        "number"
      ],
      "c": [
        "number"
      ]
    }
  ]
}
peak
  • 105,803
  • 17
  • 152
  • 177