Note: this is NOT a duplicate of following (or several other similar discussions)
- Spark SQL JSON dataset query nested datastructures
- How to use Spark SQL to parse the JSON array of objects
- Querying Spark SQL DataFrame with complex types
I have a Hive
table that I must read and process purely via Spark
-SQL-query. This table has a string
-type column, that contains JSON
dumps from APIs; so expectedly, it has deeply nested stringified JSONs.
Lets take this example (it depicts the exact depth / complexity of data that I'm trying to process)
{
"key1": ..
"key2": ..
..
"bill_summary": {
"key1": ..
"key2": ..
..
"items": [
{
"item": {
"key1": ..
"key2": ..
..
"type": "item_type_1"
..
"total_cost": 57.65
..
}
},
{
"item": {
"key1": ..
"key2": ..
..
"total_cost": 23.31
..
}
}
..
{
"item": {
"key1": ..
"key2": ..
..
"type": "item_type_1"
..
}
}
]
..
}
..
}
I'm interested in the items
array. I'm able to access it via
get_json_object(get_json_object('$.bill_summary'), '$.items') AS items
Now here's the problem
- I need to take out all (
type
,total_cost
) tuples from the array - But I only need those entries where both these are present, whereas several
item
objects have either of those or none of them
- While I've also managed to separately pick all
type
fields andtotal_cost
fields into two separate arrays, but due to second limitation above (absent fields), I end up loosing the relationship. - What I get (using following snippet) in the end are two arrays (possibly of different lengths), but with no certainty that corresponding elements of each array belong to same item or not
this snippet lists only part of my rather long SQL-query. It employs CTE
..
split(get_json_object(get_json_object(var6, '$.bill_summary'), '$.items[*].item.type'), ',') AS types_array,
split(get_json_object(get_json_object(var6, '$.bill_summary'), '$.items[*].item.total_cost'), ',') AS total_cost_array
..
Now here are the limitations
- I have no control over source
Hive
table schema or its data - I want to do this using purely
Spark
SQL-query - I cannot use
DataFrame
manipulation - I do not want to employ a registered
udf
(I'm keeping that as a last resort)
I've spent several hours on docs and forums, but the Spark-SQL docs are sparse and discussions mostly revolve around DataFrame
API, which I cannot use. Is this problem even solvable by SQL-query alone?