Is there a function in pyspark dataframe that is similar to pandas.io.json.json_normalize

Question

I would like to perform operation similar to pandas.io.json.json_normalize is pyspark dataframe. Is there an equivalent function in spark?

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.io.json.json_normalize.html

Nadim Younes · Accepted Answer · 2021-05-20T13:03:22.523

Spark has a similar function explode() but it is not entirely identical.

Here is how explode works at a very high level.

>>> from pyspark.sql.functions import explode, col

>>> data = {'A': [1, 2]}

>>> df = spark.createDataFrame(data)

>>> df.show()
 +------+
 |     A|
 +------+
 |[1, 2]|
 +------+

>>> df.select(explode(col('A')).alias('normalized')).show()
+----------+
|normalized|
+----------+
|         1|
|         2|
+----------+

On the other hand you could convert the Spark DataFrame to a Pandas DataFrame using:

spark_df.toPandas() --> leverage json_normalize() and then revert back to a Spark DataFrame.
To revert back to a Spark DataFrame you would use spark.createDataFrame(pandas_df).

Please note that this back and forth solution is not ideal as calling toPandas(), results in all records of the DataFrame to be collected (.collect()) to the driver and could result in memory errors when working with larger datasets.

The link below provides more insight on using toPandas(): DF.topandas() throwing error in pyspark

Hope this helps and good luck!

0x4a6f4672 · Answer 2 · 2021-05-18T17:38:23.093

There is no direct counterpart of json_normalize in PySpark. But Spark offers different options. If you have nested objects in a Dataframe like this

one
|_a
|_..
two
|_b
|_..

you can select in Spark the child column as follows:

import pyspark
from pyspark.sql.session import SparkSession
spark = SparkSession.builder.appName("stackoverflow demo").getOrCreate()
columns = ['id', 'one', 'two']
vals = [
     (1, {"a": False}, {"b": True}),
     (2, {"a": True}, {"b": False})
]
df = spark.createDataFrame(vals, columns)
df.select("one.a", "two.b").show()
+-----+-----+
|    a|    b|
+-----+-----+
|false| true|
| true|false|
+-----+-----+

If you build a flattened list of all nested columns using a recursive "flatten" function from this answer, then we get a flat column structure:

columns = flatten(df.schema)
df.select(columns)

Is there a function in pyspark dataframe that is similar to pandas.io.json.json_normalize

2 Answers2

Linked