I have a PySpark DataFrame with an array of structs, containing two columns (colorcode
and name
). I want to add a new column to the struct, newcol
.
This question answered "how to add a column to a nested struct", but I'm failing to transfer it to my case, where the struct is further nested inside an array. I can't seem to reference/recreate the array-struct schema.
My schema:
|-- Id: string (nullable = true)
|-- values: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- Dep: long (nullable = true)
| | |-- ABC: string (nullable = true)
What is should become:
|-- Id: string (nullable = true)
|-- values: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- Dep: long (nullable = true)
| | |-- ABC: string (nullable = true)
| | |-- newcol: string (nullable = true)
How do I transfer the solution to my nested struct?
Reproducible code to get a df of the above schema:
data = [
(10, [{"Dep": 10, "ABC": 1}, {"Dep": 10, "ABC": 1}]),
(20, [{"Dep": 20, "ABC": 1}, {"Dep": 20, "ABC": 1}]),
(30, [{"Dep": 30, "ABC": 1}, {"Dep": 30, "ABC": 1}]),
(40, [{"Dep": 40, "ABC": 1}, {"Dep": 40, "ABC": 1}])
]
myschema = StructType(
[
StructField("id", IntegerType(), True),
StructField("values",
ArrayType(
StructType([
StructField("Dep", StringType(), True),
StructField("ABC", StringType(), True)
])
))
]
)
df = spark.createDataFrame(data=data, schema=myschema)
df.printSchema()
df.show(10, False)