Create new Dataframe from a dataframe with Column contain JSON data

Question

this is the first time I ask on SO. I have a Dataframe like this

+----------+----------------------------------+
|        id|                              data|
+----------+----------------------------------+
|     '001'|     '[{"index":1}, {"index": 2}]'|
|     '002'|     '[{"index":3}, {"index": 4}]'|
+----------+----------------------------------+

I need to convert it to the new DF like that

+----------+---------+
|        id|    index|
+----------+---------+
|     '001'|        1|
|     '001'|        2|
|     '002'|        3|
|     '002'|        4|
+----------+---------+

Is there a way to do it? Thank you.

Maybe this can help you: https://stackoverflow.com/questions/34069282/how-to-query-json-data-column-using-spark-dataframes — Shaido, May 02 '19 at 07:58

howie · Accepted Answer · 2019-05-02T08:26:38.723

0

Try this ~

 df = self.spark.createDataFrame(
            [('001', '[{"index": 1}, {"index": 2}]'),
             ('002', '[{"index": 3}, {"index": 4}]'),
             ],
            ("id", "data"))

        schema = ArrayType(StructType([StructField("index", IntegerType())]))
        df = df.withColumn("json", from_json("data", schema))

        df.show(100)
        df = df.select(col("id"), explode("json").alias("index"))
        df.show(100)


+---+--------------------+----------+
| id|                data|      json|
+---+--------------------+----------+
|001|[{"index": 1}, {"...|[[1], [2]]|
|002|[{"index": 3}, {"...|[[3], [4]]|
+---+--------------------+----------+

+---+-----+
| id|index|
+---+-----+
|001|  [1]|
|001|  [2]|
|002|  [3]|
|002|  [4]|
+---+-----+

edited May 02 '19 at 08:26

answered May 02 '19 at 08:13

howie

2,587
3
27
43

Thank you, but it's not work with JSON String: cannot resolve 'explode(`data`)' due to data type mismatch: input to function explode should be array or map type, not string;; – Nghĩa Nguyễn May 02 '19 at 08:19
Hi @howie, my reality dataset is little weird: ' \'[{"index": 1}, {"index": 2}]\' ' As you see, the single quote wrap the string (with another single quote), so the function from_json return null. Do you have any suggestion for this? – Nghĩa Nguyễn May 02 '19 at 09:36

score 0 · Answer 2 · answered May 02 '19 at 13:44

This is another approach i worked out. It involves various statements, however, all of these statements can be combined in a single one to produce the desired output.

After creating the initial dataframe named 'df',

df.show(5,False)
+---+----------------------------+
|id |data                        |
+---+----------------------------+
|001|[{"index": 1}, {"index": 2}]|
|002|[{"index": 3}, {"index": 4}]|
+---+----------------------------+

df2 = df.select(col('id'),split(df.data,',').alias('list'))

This creates a dataframe named 'df2' that has second column split up into array type.

df2.show(5,False)
+---+-------------------------------+
|id |list                           |
+---+-------------------------------+
|001|[[{"index": 1},  {"index": 2}]]|
|002|[[{"index": 3},  {"index": 4}]]|
+---+-------------------------------+

then, running the explode function, df3 = df2.select(col('id'),explode(df2.list))

df3.show(5,False)
+---+--------------+
|id |col           |
+---+--------------+
|001|[{"index": 1} |
|001| {"index": 2}]|
|002|[{"index": 3} |
|002| {"index": 4}]|
+---+--------------+

followed by , df4 = df3.select(col('id'),regexp_extract('col','(\d+)',1).alias('no_only')) this transformation check for number in the exploded column.

df4.show(5,False)
+---+-------+
|id |no_only|
+---+-------+
|001|1      |
|001|2      |
|002|3      |
|002|4      |
+---+-------+

Create new Dataframe from a dataframe with Column contain JSON data

2 Answers2