1

In my Spark (2.2) DataFrame each row is JSON:

df.head()
//output
//[{"key":"111","event_name":"page-visited","timestamp":1517814315}]

df.show()
//output
//+--------------+
//|         value|
//+--------------+
//|{"key":"111...|
//|{"key":"222...|

I want to pass each JSON row to columns in order to get this result:

key   event_name     timestamp
111   page-visited   1517814315
...

I tried this approach, but it does not give me an expected result:

import org.apache.spark.sql.functions.from_json
import org.apache.spark.sql.types._

val schema = StructType(Seq(
     StructField("key", StringType, true), StructField("event_name", StringType, true), StructField("timestamp", IntegerType, true)
))

val result = df.withColumn("value", from_json($"value", schema))

and:

result.printSchema()
root
 |-- value: struct (nullable = true)
 |    |-- key: string (nullable = true)
 |    |-- event_name: string (nullable = true)
 |    |-- timestamp: integer (nullable = true)

while it should be:

result.printSchema()
root
 |-- key: string (nullable = true)
 |-- event_name: string (nullable = true)
 |-- timestamp: integer (nullable = true)
zero323
  • 322,348
  • 103
  • 959
  • 935
Markus
  • 3,562
  • 12
  • 48
  • 85

1 Answers1

6

You can use select($"value.*") in the end to select the elements of struct column into separate columns as

val result = df.withColumn("value", from_json($"value", schema)).select($"value.*")
Ramesh Maharjan
  • 41,071
  • 6
  • 69
  • 97