I am reading a dynamodb table from Spark, this table has one JSON string in one field and strings in other fields. I am able to read the JSON fields but not the nested JSON fields. This is not a DUPLICATE of query Json Column using dataframes. The question does explain how to extract columns from JSON string but not the Nested JSON columns.
import com.github.traviscrawford.spark.dynamodb._
val users = sqlContext.read.dynamodb("Dynamodb_table")
users.show(1)
Sample Data set
|col1 | ID | field2|field3|
-------------------------------------------------------------------------------------
|{"a":[{"b":"value1","x":23},{"b":value2,"x":52}],"c":"valC"}|A1 | X1 |Y1 |
I need to extract few fields from col1(JSON structure) and ID field. I am able to figure out how to parse the JSON field(col1) and get field 'c' from col1 as explained here but not able to extract the nested fields.
My code:
val users = sqlContext.read.dynamodb("Dynamodb_table")
val data = users.selectExpr("get_json_object(col1, '$.c')","get_json_object(col1, '$.a')","ID")
data.show(1,false)
|a |c |ID|
---------------------------------------------------------
|[{"b":"value1","x":23},{"b":value2","x":52}...]|valC|A1|
Now when i try to apply the same get_json_object on above data frame, i get all null values.
val nestedData = data.selectExpr("get_json_object(a, '$.b')","c","ID")
nestedData.show(false)
|get_json_object(a, '$.b')| c | ID|
------------------------------------
|null |valC|A1 |
I tried explode as well since col 'a' has array and struct. But that didn't work either as the data frame 'data' is returning col/field 'a' as a string instead of an array.Any ideas how to solve this?
Update: I also tried parsing using JSON4s and net.liftweb.json.parse . That didn't help either
case class aInfo(b: String)
case class col1(a: Option[aInfo]), c: String)
import net.liftweb.json.parse
val parseJson = udf((data: String) => {
implicit val formats = net.liftweb.json.DefaultFormats
parse(data).extract[Data]
})
val parsed = users.withColumn("parsedJSON", parseJson($"data"))
parsed.show(1)
All values came out as null when i used these parsers.
My expected result: I am trying to get a flattened out structure from the dataset
|b |x |c | ID|
--------------------
|value1|23|valC|A1 |
|value2|52|valC|A1 |