From C* to DataFrames with JSON

Question

I'm struggling (Spark / Scala newbie) to achieve the following conversion that slpits the json into columns.

e.g. From C* table

===================================
| id | jsonData                   |
===================================
| 1  |  {"a": "123", "b": "xyz" } |
+----+----------------------------+
| 2  |  {"a": "3", "b": "bar" }   |
-----------------------------------

to Spark DataFrame:

==================
| id |  a  |  b  |
==================
| 1  | 123 | xyz |
+----+-----+-----+
| 2  | 3   | bar |
------------------

I'm using Spark 1.6 and Scala 2.10.

Update: I don't know the key names (or many) of JSON.

@zero323 The solution from stackoverflow.com/questions/34069282/… assumes that I know the json keys, not the case for my scenario . Also the top answer in stackoverflow.com/questions/30033875/… is closest but the final DataFrame doesn't keeep the other columns. — jfgosselin, Feb 10 '17 at 20:38
You can use schema inference with both (using the same RDD mechanism to infer schema) , and you can keep other columns with both (as long as you have unique field to identify other columns). It just costs much more than going directly, and if schema is to complex to specify it typically won't work well with Spark. — zero323, Feb 10 '17 at 20:51
Sorry you lost me (still a newbie), could you show an example. Thanks — jfgosselin, Feb 11 '17 at 06:10
Example is in the linked question. just check `Spark <= 1.5` part :) For schema `val schema = spark.read.json(df.select($"jsonData").as[String].rdd).schema; df.withColumn("jsonData", from_json($"jsonData", schema))` but it will be very expensive. — zero323, Feb 11 '17 at 12:32

From C* to DataFrames with JSON

0 Answers0