I am reading a dynamodb table from Spark, this table has one JSON string in one field and strings in other fields. This is not a DUPLICATE of query Json Column using dataframes. The question does explain how to extract columns from JSON string but not the Nested JSON columns.
val users = sqlContext.read.dynamodb("Dynamodb_table")
users.show(1)
Sample Data set
|col1 | ID | field2|field3|
----------------------------------------------------------
|{"a":[{"b":"value1"},{"b":value2}],"c":"valC"...}|A1 | X1 |Y1 |
I need to extract few fields from col1(JSON structure) and ID field. I am able to figure out how to parse the JSON field(col1) as explained here but not able to extract the nested fields
My code:
val users = sqlContext.read.dynamodb("Dynamodb_table")
val data = users.selectExpr("get_json_object(col1, '$.c')","get_json_object(col1, '$.a')","ID")
data.show(1,false)
|a |c |ID|
-------------------------------------------
|[{"b":"value1"},{"b":value2"}...]|valC|A1|
Now when i try to apply the same get_json_object on above data frame, i get all null values.
val nestedData = data.selectExpr("get_json_object(a, '$.b')","c","ID")
nestedData.show(false)
|get_json_object(a, '$.b')| c | ID|
------------------------------------
|null |valC|A1 |
I tried explode as well since col 'a' has array and struct. But that didn't work either as the data frame 'data' is returning col/field 'a' as a string instead of an array.Any ideas how to solve this?
My expected result: I am trying to get a flattened out structure from the dataset
|b |c | ID|
-----------------
|value1|valC|A1 |
|value2|valC|A1 |