I am processing streaming events of different types and different schema in spark with scala and I need to parse them, and save them in a format that's easy to process in a generic way further.
I have a dataframe of events that looks like this:
val df = Seq(("{\"a\": 1, \"b\": 2, \"c\": 3 }", "One", "001") ,("{\"a\": 6, \"b\": 2, \"d\": 2, \"f\": 8 }", "Two", "089"), ("{\"a\": 3, \"b\": 4, \"c\": 6 }", "One", "123")).toDF("col1", "col2", "col3")
which is this:
+------------------------------------+--------+------+
| body | type | id |
+------------------------------------+--------+------+
|{"a": 1, "b": 2, "c": 3 } | "One"| 001|
|{"a": 6, "d": 2, "f": 8, "g": 10} | "Two"| 089|
|{"a": 3, "b": 4, "c": 6 } | "Three"| 123|
+------------------------------------+--------+------+
and I would like to turn it into this one. We can assume that all the type "One" will have the same schema, and all types of events will share some similar data such as the entry "a", which i would like to surface into its own column
+---+--------------------------------+--------+------+
| a | data | y | z |
+---+--------------------------------+--------+------+
| 1 |{"b": 2, "c": 3 } | "One"| 001|
| 6 |{"d": 2, "f": 8, "g": 10} | "Two"| 089|
| 3 |{"b": 4, "c": 6 } | "Three"| 123|
+------------------------------------+--------+------+