How do I add a column to a nested struct in a PySpark dataframe?

Question

I have a dataframe with a schema like

root
 |-- state: struct (nullable = true)
 |    |-- fld: integer (nullable = true)

I'd like to add columns within the state struct, that is, create a dataframe with a schema like

root
 |-- state: struct (nullable = true)
 |    |-- fld: integer (nullable = true)
 |    |-- a: integer (nullable = true)

I tried

df.withColumn('state.a', val).printSchema()
# root
#  |-- state: struct (nullable = true)
#  |    |-- fld: integer (nullable = true)
#  |-- state.a: integer (nullable = true)

You can create a new column using a udf with the schema you desire and drop the old one. As far as I know, you can't change the schema of struct column. [see this question](https://stackoverflow.com/questions/45824403/pyspark-change-nested-column-datatype/45841615#45841615) — pauli, Feb 14 '18 at 03:07

pault · Accepted Answer · 2018-02-15T15:13:27.957

Here is a way to do it without using a udf:

# create example dataframe
import pyspark.sql.functions as f
data = [
    ({'fld': 0},)
]

schema = StructType(
    [
        StructField('state',
            StructType(
                [StructField('fld', IntegerType())]
            )
        )
    ]
)

df = sqlCtx.createDataFrame(data, schema)
df.printSchema()
#root
# |-- state: struct (nullable = true)
# |    |-- fld: integer (nullable = true)

Now use withColumn() and add the new field using lit() and alias().

val = 1
df_new = df.withColumn(
    'state', 
    f.struct(*[f.col('state')['fld'].alias('fld'), f.lit(val).alias('a')])
)
df_new.printSchema()
#root
# |-- state: struct (nullable = false)
# |    |-- fld: integer (nullable = true)
# |    |-- a: integer (nullable = false)

If you have a lot of fields in the nested struct you can use a list comprehension, using df.schema["state"].dataType.names to get the field names. For example:

val = 1
s_fields = df.schema["state"].dataType.names # ['fld']
df_new = df.withColumn(
    'state', 
    f.struct(*([f.col('state')[c].alias(c) for c in s_fields] + [f.lit(val).alias('a')]))
)
df_new.printSchema()
#root
# |-- state: struct (nullable = false)
# |    |-- fld: integer (nullable = true)
# |    |-- a: integer (nullable = false)

References

I found a way to get the field names from the Struct without naming them manually from this answer.

I see, use `withColumn` to replace the `struct` with a new struct, so copy over the old fields. This works, thanks! I wonder if there is a way to add field to the struct, without having to name all the existing sub fields? — MrCartoonology, Feb 14 '18 at 17:12
@MrCartoonology I found a cleaner way to get the field names. See the update. — pault, Feb 15 '18 at 14:55

score 19 · Answer 2 · answered Feb 05 '21 at 17:00

19

Use a transformation such as the following:

import pyspark.sql.functions as f

df = df.withColumn(
    "state",
    f.struct(
        f.col("state.*"),
        f.lit(123).alias("a")
    )
)

answered Feb 05 '21 at 17:00

malthe

1,237
13
25

AnalysisException: Can only star expand struct data types. Attribute: `ArrayBuffer(state)`; – Blue Clouds Aug 09 '23 at 21:12
@BlueClouds what's your dataframe schema (specifically, what's the column type for `state` in this case) – ? – malthe Aug 25 '23 at 07:49
state is a struct – Blue Clouds Aug 25 '23 at 14:27
are you sure about that because you're getting an error message that suggests it's not a struct data type – what happens if you print the schema? – malthe Aug 26 '23 at 21:49

score 5 · Answer 3 · answered Jul 26 '18 at 14:44

Although this is a too late answer, for pyspark version 2.x.x following is supported.

Assuming dfOld already contains state and fld as asked in question.

dfOld.withColumn("a","value") dfNew = dfOld.select("level1Field1", "level1Field2", struct(col("state.fld").alias("fld"), col("a")).alias("state"))

Reference: https://medium.com/@mrpowers/adding-structtype-columns-to-spark-dataframes-b44125409803

score 2 · Answer 4 · answered Jul 15 '22 at 14:34

Spark 3.1+

F.col('state').withField('a', F.lit(1))

Example:

from pyspark.sql import functions as F
df = spark.createDataFrame([((1,),)], 'state:struct<fld:int>')
df.printSchema()
# root
#  |-- state: struct (nullable = true)
#  |    |-- fld: integer (nullable = true)

df = df.withColumn('state', F.col('state').withField('a', F.lit(1)))
df.printSchema()
# root
#  |-- state: struct (nullable = true)
#  |    |-- fld: integer (nullable = true)
#  |    |-- a: integer (nullable = false)

TypeError: 'Column' object is not callable – Blue Clouds Aug 09 '23 at 21:10 — Blue Clouds, Aug 09 '23 at 21:10

Clay · Answer 5 · 2020-09-17T12:11:01.577

Here's a way to do it without a udf.

Initialize example dataframe:

nested_df1 = (spark.read.json(sc.parallelize(["""[
        { "state": {"fld": 1} },
        { "state": {"fld": 2}}
    ]"""])))

nested_df1.printSchema()

root
 |-- state: struct (nullable = true)
 |    |-- fld: long (nullable = true)

Spark .read.json imports all integers as long by default. If state.fld has to be an int, you will need to cast it.

from pyspark.sql import functions as F

nested_df1 = (nested_df1
    .select( F.struct(F.col("state.fld").alias("fld").cast('int')).alias("state") ))

nested_df1.printSchema()

root
 |-- state: struct (nullable = false)
 |    |-- col1: integer (nullable = true)

nested_df1.show()

+-----+
|state|
+-----+
|  [1]|
|  [2]|
+-----+

Finally

Use .select to get the nested columns you want from the existing struct with the "parent.child" notation, create the new column, then re-wrap the old columns together with the new columns in a struct.

val_a = 3

nested_df2 = (nested_df
    .select( 
        F.struct(
            F.col("state.fld"), 
            F.lit(val_a).alias("a")
        ).alias("state")
    )
)


nested_df2.printSchema()

root
 |-- state: struct (nullable = false)
 |    |-- fld: integer (nullable = true)
 |    |-- a: integer (nullable = false)

nested_df2.show()

+------+
| state|
+------+
|[1, 3]|
|[2, 3]|
+------+

Flatten if needed with "parent.*".

nested_df2.select("state.*").printSchema()

root
 |-- fld: integer (nullable = true)
 |-- a: integer (nullable = false)

nested_df2.select("state.*").show()

+---+---+
|fld|  a|
+---+---+
|  1|  3|
|  2|  3|
+---+---+

score 1 · Answer 6 · edited Aug 08 '22 at 21:26

1

You can use the struct function

import pyspark.sql.functions as f

df = df.withColumn(
    "state",
    f.struct(
        f.col("state.fld").alias("fld"),
        f.lit(1).alias("a")
    )
)

edited Aug 08 '22 at 21:26

buddemat

4,552
14
29
49

answered Aug 04 '22 at 21:16

Henrique Maia

11
3

score -2 · Answer 7 · edited Sep 12 '18 at 00:36

from pyspark.sql.functions import *
from pyspark.sql.types import *
def add_field_in_dataframe(nfield, df, dt): 
    fields = nfield.split(".")
    print fields
    n = len(fields)
    addField = fields[0]  
    if n == 1:
        return df.withColumn(addField, lit(None).cast(dt))

    nestedField = ".".join(fields[:-1])
    sfields = df.select(nestedField).schema[fields[-2]].dataType.names
    print sfields
    ac = col(nestedField)
    if n == 2:
        nc = struct(*( [ac[c].alias(c) for c in sfields] + [lit(None).cast(dt).alias(fields[-1])]))
    else:
        nc = struct(*( [ac[c].alias(c) for c in sfields] + [lit(None).cast(dt).alias(fields[-1])])).alias(fields[-2])
    print nc
    n = n - 1

    while n > 1: 
        print "n: ",n
        fields = fields[:-1]
        print "fields: ", fields
        nestedField = ".".join(fields[:-1])
        print "nestedField: ", nestedField
        sfields = df.select(nestedField).schema[fields[-2]].dataType.names
        print fields[-1]
        print "sfields: ", sfields
        sfields = [s for s in sfields if s != fields[-1]]
        print "sfields: ", sfields
        ac = col(".".join(fields[:-1]))
        if n > 2: 
            print fields[-2]
            nc = struct(*( [ac[c].alias(c) for c in sfields] + [nc])).alias(fields[-2])
        else:
            nc = struct(*( [ac[c].alias(c) for c in sfields] + [nc]))
        n = n - 1
    return df.withColumn(addField, nc)

How do I add a column to a nested struct in a PySpark dataframe?

7 Answers7

Finally

Linked