0

I'm struggling (Spark / Scala newbie) to achieve the following conversion that slpits the json into columns.

e.g. From C* table

===================================
| id | jsonData                   |
===================================
| 1  |  {"a": "123", "b": "xyz" } |
+----+----------------------------+
| 2  |  {"a": "3", "b": "bar" }   |
-----------------------------------

to Spark DataFrame:

==================
| id |  a  |  b  |
==================
| 1  | 123 | xyz |
+----+-----+-----+
| 2  | 3   | bar |
------------------

I'm using Spark 1.6 and Scala 2.10.

Update: I don't know the key names (or many) of JSON.

jfgosselin
  • 395
  • 1
  • 2
  • 10
  • @zero323 The solution from stackoverflow.com/questions/34069282/… assumes that I know the json keys, not the case for my scenario . Also the top answer in stackoverflow.com/questions/30033875/… is closest but the final DataFrame doesn't keeep the other columns. – jfgosselin Feb 10 '17 at 20:38
  • You can use schema inference with both (using the same RDD mechanism to infer schema) , and you can keep other columns with both (as long as you have unique field to identify other columns). It just costs much more than going directly, and if schema is to complex to specify it typically won't work well with Spark. – zero323 Feb 10 '17 at 20:51
  • Sorry you lost me (still a newbie), could you show an example. Thanks – jfgosselin Feb 11 '17 at 06:10
  • Example is in the linked question. just check `Spark <= 1.5` part :) For schema `val schema = spark.read.json(df.select($"jsonData").as[String].rdd).schema; df.withColumn("jsonData", from_json($"jsonData", schema))` but it will be very expensive. – zero323 Feb 11 '17 at 12:32

0 Answers0