-2

I have a data frame which is json column with json string. example below. There are 3 columns - a,b,c. Column c is stringType

| a         | b    |           c                       |
--------------------------------------------------------
|77         |ABC   |    {"12549":38,"333513":39}       |
|78         |ABC   |    {"12540":38,"333513":39}       |

I want to make them into columns of the data frame(pivot). the example below -

| a         | b    | 12549  | 333513 | 12540
---------------------------------------------
|77         |ABC   |38      |39      | null
|77         |ABC   | null   |39      | 38
lucy
  • 4,136
  • 5
  • 30
  • 47

1 Answers1

7

This may not be the most efficient, as it has to read all of the json records an extra time to infer the schema. If you can statically define the schema, it should do better.

val data = spark.createDataset(Seq(
  (77, "ABC", "{\"12549\":38,\"333513\":39}"),
  (78, "ABC", "{\"12540\":38,\"333513\":39}")
)).toDF("a", "b", "c")

val schema = spark.read.json(data.select("c").as[String]).schema

data.select($"a", $"b", from_json($"c", schema).as("s")).select("a", "b", "s.*").show(false)

Result:

+---+---+-----+-----+------+
|a  |b  |12540|12549|333513|
+---+---+-----+-----+------+
|77 |ABC|null |38   |39    |
|78 |ABC|38   |null |39    |
+---+---+-----+-----+------+
Travis Hegner
  • 2,465
  • 1
  • 12
  • 11
  • great. But I am getting _corrupt_record column also – lucy Mar 25 '19 at 15:39
  • Some of your json data is either corrupt or contains newlines. You can try with a multiline option `val schema = spark.read.option("multiline", true).json(data.select("c").as[String]).schema` or you'll have to filter or correct the corrupt data: [source](https://stackoverflow.com/questions/35409539/corrupt-record-error-when-reading-a-json-file-into-spark) – Travis Hegner Mar 25 '19 at 15:51
  • How to make the header be something like "a, b, json_12540, json_12549, json_333513"? – Hang Wu Aug 24 '21 at 02:26
  • To do this dynamically, I think you'd have to iterate `df.columns` and selectively rename them individually with `.withColumnRenamed()`. I will reiterate, however, that a static schema would be much more performant in this case. – Travis Hegner Aug 24 '21 at 13:03