2

I have a PySpark DataFrame with an array of structs, containing two columns (colorcode and name). I want to add a new column to the struct, newcol.

This question answered "how to add a column to a nested struct", but I'm failing to transfer it to my case, where the struct is further nested inside an array. I can't seem to reference/recreate the array-struct schema.

My schema:

 |-- Id: string (nullable = true)
 |-- values: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- Dep: long (nullable = true)
 |    |    |-- ABC: string (nullable = true)

What is should become:

 |-- Id: string (nullable = true)
 |-- values: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- Dep: long (nullable = true)
 |    |    |-- ABC: string (nullable = true)
 |    |    |-- newcol: string (nullable = true)

How do I transfer the solution to my nested struct?

Reproducible code to get a df of the above schema:

data = [
    (10, [{"Dep": 10, "ABC": 1}, {"Dep": 10, "ABC": 1}]),
    (20, [{"Dep": 20, "ABC": 1}, {"Dep": 20, "ABC": 1}]),
    (30, [{"Dep": 30, "ABC": 1}, {"Dep": 30, "ABC": 1}]),
    (40, [{"Dep": 40, "ABC": 1}, {"Dep": 40, "ABC": 1}])
  ]
myschema = StructType(
[
    StructField("id", IntegerType(), True),
    StructField("values",
                ArrayType(
                    StructType([
                        StructField("Dep", StringType(), True),
                        StructField("ABC", StringType(), True)
                    ])
    ))
]
)
df = spark.createDataFrame(data=data, schema=myschema)
df.printSchema()
df.show(10, False)
Cribber
  • 2,513
  • 2
  • 21
  • 60

2 Answers2

5

For spark version >= 3.1, you can use the transform function and withField method to achieve this.

transform performs the transformation calculation according to the provided function for each element (struct(Dep, ABC) here) in the array (values column here). withField adds/replaces a field in StructType by name.

df = df.withColumn('values', F.transform('values', lambda x: x.withField('newcol', F.lit(1))))
过过招
  • 3,722
  • 2
  • 4
  • 11
  • 1
    try: ```df = df.withColumn('values', F.transform('values', lambda x: x.withField('Dep', x['Dep'].cast('int'))))``` – 过过招 Mar 31 '22 at 07:47
  • This looks neater `df.withColumn("values",F.expr("transform(values, x -> struct(cast((x.Dep) as integer) as Dep, x.ABC))"))` – wwnde Mar 31 '22 at 09:59
  • 2
    It depends on personal habits and familiarity. At the beginning, I was used to answering questions using spark sql expressions, but I found that many people are more used to the dataframe API. – 过过招 Mar 31 '22 at 10:05
1

Another way, of doing it would be using sql expressions.

df = df.withColumn("values",F.expr("transform(values, x -> struct(COALESCE('1') as newcol,x.Dep,x.ABC))"))
wwnde
  • 26,119
  • 6
  • 18
  • 32