How to flatten nested arrays with different shapes in PySpark? Here is answered How to flatten nested arrays by merging values in spark with same shape arrays . I'm getting errors described below for arrays with different shapes.
Data-structure:
- Static names:
id
,date
,val
,num
(can be hardcoded) - Dynamic names:
name_1_a
,name_10000_xvz
(cannot be hardcoded as the data frame has up to 10000 columns/arrays)
Input df:
root
|-- id: long (nullable = true)
|-- name_10000_xvz: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- date: long (nullable = true)
| | |-- num: long (nullable = true) **NOTE: additional `num` field **
| | |-- val: long (nullable = true)
|-- name_1_a: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- date: long (nullable = true)
| | |-- val: long (nullable = true)
df.show(truncate=False)
+---+---------------------------------------------------------------------+---------------------------------+
|id |name_10000_xvz |name_1_a |
+---+---------------------------------------------------------------------+---------------------------------+
|1 |[{2000, null, 30}, {2001, null, 31}, {2002, null, 32}, {2003, 1, 33}]|[{2001, 1}, {2002, 2}, {2003, 3}]|
+---+---------------------------------------------------------------------+---------------------------------+
Required output df:
+---+--------------+----+---+---+
| id| name|date|val|num|
+---+--------------+----+---+---+
| 1| name_1_a|2001| 1| |
| 1| name_1_a|2002| 2| |
| 1| name_1_a|2003| 3| |
| 1|name_10000_xvz|2000| 30| |
| 1|name_10000_xvz|2001| 31| |
| 1|name_10000_xvz|2002| 32| |
| 1|name_10000_xvz|2003| 33| 1 |
+---+--------------+----+---+---+
Code to reproduce:
NOTE: when i add el.num
in TRANSFORM({name}, el -> STRUCT("{name}" AS name, el.date, el.val, el.num
I get the error below.
import pyspark.sql.functions as f
df = spark.read.json(
sc.parallelize(
[
"""{"id":1,"name_1_a":[{"date":2001,"val":1},{"date":2002,"val":2},{"date":2003,"val":3}],"name_10000_xvz":[{"date":2000,"val":30},{"date":2001,"val":31},{"date":2002,"val":32},{"date":2003,"val":33, "num":1}]}"""
]
)
).select("id", "name_1_a", "name_10000_xvz")
names = [column for column in df.columns if column.startswith("name_")]
expressions = []
for name in names:
expressions.append(
f.expr(
'TRANSFORM({name}, el -> STRUCT("{name}" AS name, el.date, el.val, el.num))'.format(
name=name
)
)
)
flatten_df = df.withColumn("flatten", f.flatten(f.array(*expressions))).selectExpr(
"id", "inline(flatten)"
)
Output:
AnalysisException: No such struct field num in date, Val