I would like to perform operation similar to pandas.io.json.json_normalize is pyspark dataframe. Is there an equivalent function in spark?
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.io.json.json_normalize.html
I would like to perform operation similar to pandas.io.json.json_normalize is pyspark dataframe. Is there an equivalent function in spark?
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.io.json.json_normalize.html
Spark has a similar function explode()
but it is not entirely identical.
Here is how explode works at a very high level.
>>> from pyspark.sql.functions import explode, col
>>> data = {'A': [1, 2]}
>>> df = spark.createDataFrame(data)
>>> df.show()
+------+
| A|
+------+
|[1, 2]|
+------+
>>> df.select(explode(col('A')).alias('normalized')).show()
+----------+
|normalized|
+----------+
| 1|
| 2|
+----------+
On the other hand you could convert the Spark DataFrame to a Pandas DataFrame using:
spark_df.toPandas()
--> leverage json_normalize() and then revert back to a Spark
DataFrame.
To revert back to a Spark DataFrame you would use spark.createDataFrame(pandas_df)
.
Please note that this back and forth solution is not ideal as calling toPandas(), results in all records of the DataFrame to be collected (.collect()) to the driver and could result in memory errors when working with larger datasets.
The link below provides more insight on using toPandas(): DF.topandas() throwing error in pyspark
Hope this helps and good luck!
There is no direct counterpart of json_normalize
in PySpark. But Spark offers different options. If you have nested objects in a Dataframe like this
one
|_a
|_..
two
|_b
|_..
you can select in Spark the child column as follows:
import pyspark
from pyspark.sql.session import SparkSession
spark = SparkSession.builder.appName("stackoverflow demo").getOrCreate()
columns = ['id', 'one', 'two']
vals = [
(1, {"a": False}, {"b": True}),
(2, {"a": True}, {"b": False})
]
df = spark.createDataFrame(vals, columns)
df.select("one.a", "two.b").show()
+-----+-----+
| a| b|
+-----+-----+
|false| true|
| true|false|
+-----+-----+
If you build a flattened list of all nested columns using a recursive "flatten" function from this answer, then we get a flat column structure:
columns = flatten(df.schema)
df.select(columns)