1

I am reading a dynamodb table from Spark, this table has one JSON string in one field and strings in other fields. This is not a DUPLICATE of query Json Column using dataframes. The question does explain how to extract columns from JSON string but not the Nested JSON columns.

val users = sqlContext.read.dynamodb("Dynamodb_table")

users.show(1)

Sample Data set

 |col1                                              | ID | field2|field3|
 ----------------------------------------------------------
 |{"a":[{"b":"value1"},{"b":value2}],"c":"valC"...}|A1  | X1    |Y1    |

I need to extract few fields from col1(JSON structure) and ID field. I am able to figure out how to parse the JSON field(col1) as explained here but not able to extract the nested fields

My code:

val users = sqlContext.read.dynamodb("Dynamodb_table")
val data = users.selectExpr("get_json_object(col1, '$.c')","get_json_object(col1, '$.a')","ID")

data.show(1,false)
|a                                |c   |ID|
-------------------------------------------
|[{"b":"value1"},{"b":value2"}...]|valC|A1|

Now when i try to apply the same get_json_object on above data frame, i get all null values.

val nestedData = data.selectExpr("get_json_object(a, '$.b')","c","ID")
nestedData.show(false)

|get_json_object(a, '$.b')| c  | ID|
------------------------------------
|null                     |valC|A1 |    

I tried explode as well since col 'a' has array and struct. But that didn't work either as the data frame 'data' is returning col/field 'a' as a string instead of an array.Any ideas how to solve this?

My expected result: I am trying to get a flattened out structure from the dataset

|b     |c   | ID|
-----------------
|value1|valC|A1 |
|value2|valC|A1 |
Community
  • 1
  • 1
dheee
  • 1,588
  • 3
  • 15
  • 25

0 Answers0