How to create new string column in PySpark DataFrame based on values of other columns?

Question

I have a PySpark dataframe that has a couple of fields, e.g.:

Id	Name	Surname
1	John	Johnson
2	Anna	Maria

I want to create a new column that would mix the values of other comments into a new string. Desired output is:

Id	Name	Surname	New
1	John	Johnson	Hey there John Johnson!
2	Anna	Maria	Hey there Anna Maria!

I'm trying to do (pseudocode):

df = df.withColumn("New", "Hey there " + Name + " " + Surname + "!")

How can this be achieved?

wrap the literal values in `lit()` and the column names in `col()`. concatenation can be done using `concat()`. see [func doc](https://spark.apache.org/docs/3.3.0/api/python/reference/pyspark.sql/functions.html) for more details. — samkart, Aug 03 '22 at 16:51

score 3 · Accepted Answer · answered Aug 03 '22 at 16:51

You can use concat function or format_string like this:

from pyspark.sql import functions as F

df = df.withColumn(
    "New", 
    F.format_string("Hey there %s %s!", "Name", "Surname")
)

df.show(truncate=False)
# +---+----+-------+-----------------------+
# |Id |Name|Surname|New                    |
# +---+----+-------+-----------------------+
# |1  |John|Johnson|Hey there John Johnson!|
# |2  |Anna|Maria  |Hey there Anna Maria!  |
# +---+----+-------+-----------------------+

If you prefer using concat:

F.concat(F.lit("Hey there "), F.col("Name"), F.lit(" "), F.col("Surname"), F.lit("!"))

How to create new string column in PySpark DataFrame based on values of other columns?

1 Answers1