How to parse json column in dataframe in scala

Question

I have a data frame which is json column with json string. example below. There are 3 columns - a,b,c. Column c is stringType

| a         | b    |           c                       |
--------------------------------------------------------
|77         |ABC   |    {"12549":38,"333513":39}       |
|78         |ABC   |    {"12540":38,"333513":39}       |

I want to make them into columns of the data frame(pivot). the example below -

| a         | b    | 12549  | 333513 | 12540
---------------------------------------------
|77         |ABC   |38      |39      | null
|77         |ABC   | null   |39      | 38

So what would be the logic in that case? Could you provide an example that fully reflects what you are trying to do? (and possibly a more extensive explanation) — Oli, Mar 25 '19 at 13:42
Oh right, nice. And it is possible to have nested elements s.a. `{"a":{"x" : 4, "y" : 7}}`? — Oli, Mar 25 '19 at 13:45
No, it will always {"x" : 4, "y" : 7}. where x and y are viable — lucy, Mar 25 '19 at 13:46

score 7 · Accepted Answer · answered Mar 25 '19 at 15:24

7

This may not be the most efficient, as it has to read all of the json records an extra time to infer the schema. If you can statically define the schema, it should do better.

val data = spark.createDataset(Seq(
  (77, "ABC", "{\"12549\":38,\"333513\":39}"),
  (78, "ABC", "{\"12540\":38,\"333513\":39}")
)).toDF("a", "b", "c")

val schema = spark.read.json(data.select("c").as[String]).schema

data.select($"a", $"b", from_json($"c", schema).as("s")).select("a", "b", "s.*").show(false)

Result:

+---+---+-----+-----+------+
|a  |b  |12540|12549|333513|
+---+---+-----+-----+------+
|77 |ABC|null |38   |39    |
|78 |ABC|38   |null |39    |
+---+---+-----+-----+------+

answered Mar 25 '19 at 15:24

Travis Hegner

2,465
1
12
11

great. But I am getting _corrupt_record column also – lucy Mar 25 '19 at 15:39
Some of your json data is either corrupt or contains newlines. You can try with a multiline option `val schema = spark.read.option("multiline", true).json(data.select("c").as[String]).schema` or you'll have to filter or correct the corrupt data: [source](https://stackoverflow.com/questions/35409539/corrupt-record-error-when-reading-a-json-file-into-spark) – Travis Hegner Mar 25 '19 at 15:51
How to make the header be something like "a, b, json_12540, json_12549, json_333513"? – Hang Wu Aug 24 '21 at 02:26
To do this dynamically, I think you'd have to iterate `df.columns` and selectively rename them individually with `.withColumnRenamed()`. I will reiterate, however, that a static schema would be much more performant in this case. – Travis Hegner Aug 24 '21 at 13:03

How to parse json column in dataframe in scala

1 Answers1

Linked