That's going to be tricky since the top-level elements of the JSONs are not the same, i.e. you have map1
and map2
, and hence the schema is inconsistent. I'd speak to the "data producer" and requests a change so the name of the command is described by a separate element.
Given the schema of the DataFrame is as follows:
scala> commands.printSchema
root
|-- commands: array (nullable = true)
| |-- element: string (containsNull = true)
and the number of elements (rows) in it:
scala> commands.count
res1: Long = 1
You have to explode the commands
array of elements first followed by accessing the fields of interest.
// 1. Explode the array
val commandsExploded = commands.select(explode($"commands") as "command")
scala> commandsExploded.count
res2: Long = 2
Let's create the schema of the JSON-encoded records. One could be as follows.
// Note that it accepts map1 and map2 fields
import org.apache.spark.sql.types._
val schema = StructType(
StructField("map1",
StructType(
StructField("completed-stages", LongType, true) ::
StructField("total-stages", LongType, true) :: Nil), true) ::
StructField("map2",
StructType(
StructField("completed-stages", LongType, true) ::
StructField("total-stages", LongType, true) :: Nil), true) ::
StructField("rec", StringType,true) ::
StructField("status", StructType(
StructField("state", StringType, true) :: Nil), true
) :: Nil)
With that, you should use from_json standard function that takes a column with JSON-encoded strings and a schema.
val commands = commandsExploded.select(from_json($"command", schema) as "command")
scala> commands.show(truncate = false)
+-------------------------------+
|command |
+-------------------------------+
|[[1, 1],, test-plan, [SUCCESS]]|
|[, [1, 1], test-proc, [FAILED]]|
+-------------------------------+
Let's have a look at the schema of the commands
dataset.
scala> commands.printSchema
root
|-- command: struct (nullable = true)
| |-- map1: struct (nullable = true)
| | |-- completed-stages: long (nullable = true)
| | |-- total-stages: long (nullable = true)
| |-- map2: struct (nullable = true)
| | |-- completed-stages: long (nullable = true)
| | |-- total-stages: long (nullable = true)
| |-- rec: string (nullable = true)
| |-- status: struct (nullable = true)
| | |-- state: string (nullable = true)
The complex fields like rec
and status
are structs that are .
-accessible.
val recs = commands.select(
$"command.rec" as "rec",
$"command.status.state" as "status")
scala> recs.show
+---------+-------+
| rec| status|
+---------+-------+
|test-plan|SUCCESS|
|test-proc| FAILED|
+---------+-------+
Converting it to a single-record JSON-encoded dataset requires Dataset.toJSON followed by collect_list standard function.
val result = recs.toJSON.agg(collect_list("value"))
scala> result.show(truncate = false)
+-------------------------------------------------------------------------------+
|collect_list(value) |
+-------------------------------------------------------------------------------+
|[{"rec":"test-plan","status":"SUCCESS"}, {"rec":"test-proc","status":"FAILED"}]|
+-------------------------------------------------------------------------------+