5

Note: this is NOT a duplicate of following (or several other similar discussions)


I have a Hive table that I must read and process purely via Spark-SQL-query. This table has a string-type column, that contains JSON dumps from APIs; so expectedly, it has deeply nested stringified JSONs.

Lets take this example (it depicts the exact depth / complexity of data that I'm trying to process)

{
    "key1": ..
    "key2": ..
    ..
    "bill_summary": {
        "key1": ..
        "key2": ..
        ..
        "items": [
            {
                "item": {
                    "key1": ..
                    "key2": ..
                    ..
                    "type": "item_type_1"
                    ..
                    "total_cost": 57.65
                    ..
                }
            },
            {
                "item": {
                    "key1": ..
                    "key2": ..
                    ..
                    "total_cost": 23.31
                    ..
                }
            }
            ..
            {
                "item": {
                    "key1": ..
                    "key2": ..
                    ..
                    "type": "item_type_1"
                    ..
                }
            }
        ]
        ..
    }
    ..
}

I'm interested in the items array. I'm able to access it via

get_json_object(get_json_object('$.bill_summary'), '$.items') AS items

Now here's the problem

  • I need to take out all (type, total_cost) tuples from the array
  • But I only need those entries where both these are present, whereas several item objects have either of those or none of them

  • While I've also managed to separately pick all type fields and total_cost fields into two separate arrays, but due to second limitation above (absent fields), I end up loosing the relationship.
  • What I get (using following snippet) in the end are two arrays (possibly of different lengths), but with no certainty that corresponding elements of each array belong to same item or not

this snippet lists only part of my rather long SQL-query. It employs CTE

..
split(get_json_object(get_json_object(var6, '$.bill_summary'), '$.items[*].item.type'), ',') AS types_array,
split(get_json_object(get_json_object(var6, '$.bill_summary'), '$.items[*].item.total_cost'), ',') AS total_cost_array
..

Now here are the limitations

  • I have no control over source Hive table schema or its data
  • I want to do this using purely SparkSQL-query
  • I cannot use DataFrame manipulation
  • I do not want to employ a registered udf (I'm keeping that as a last resort)

I've spent several hours on docs and forums, but the Spark-SQL docs are sparse and discussions mostly revolve around DataFrame API, which I cannot use. Is this problem even solvable by SQL-query alone?

y2k-shubham
  • 10,183
  • 11
  • 55
  • 131

1 Answers1

7

After hours of scouring the web, this answer hinted me that I can cast a stringified JSON array to array of structs in spark-sql. Finally here's what I did

     ..
     var6_items AS
  (SELECT hash_id,
          entity1,
          dt,
          get_json_object(get_json_object(var6,'$.bill_summary'), '$.items[*].item') AS items_as_string
   FROM rows_with_appversion
   WHERE appversion >= 14),

     filtered_var6_items AS
  (SELECT *
   FROM var6_items
   WHERE items_as_string IS NOT NULL
     AND items_as_string != '')

    SELECT from_json(items_as_string, 'array<struct<type:string,total_cost:string>>') AS items_as_struct_array
    FROM filtered_var6_items
     ..

explanation

  • the expression get_json_object(get_json_object(var6,'$.bill_summary'), '$.items[*].item') AS items_as_string results in items_as_string containing following (stringified) JSON (note that one level of redundant nesting around each item has also been removed)
[
  {
    "key1": "val1",
    "key2": "val2",
    "type": "item_type_1",
    "total_cost": 57.65
  },
  {
    "key1": "val1",
    "key2": "val2",
    "total_cost": 57.65
  }
  ..
  {
    "key1": "val1",
    "key2": "val2",
    "type": "item_type_1"
  }
]
  • thereafter, from_json function allows casting the above structure into an array of structs. Once that is obtained, I can filter structs that have both type and total_cost as not NULL

References

y2k-shubham
  • 10,183
  • 11
  • 55
  • 131
  • thank you for sharing. one question: is there a reason why it's a `get_json_object()` inside of another `get_json_object()`? – 123 Mar 11 '22 at 21:27