Add an empty column to Spark DataFrame

Question

As mentioned in many other locations on the web, adding a new column to an existing DataFrame is not straightforward. Unfortunately it is important to have this functionality (even though it is inefficient in a distributed environment) especially when trying to concatenate two DataFrames using unionAll.

What is the most elegant workaround for adding a null column to a DataFrame to facilitate a unionAll?

My version goes like this:

from pyspark.sql.types import StringType
from pyspark.sql.functions import UserDefinedFunction
to_none = UserDefinedFunction(lambda x: None, StringType())
new_df = old_df.withColumn('new_column', to_none(df_old['any_col_from_old']))

score 185 · Accepted Answer · edited Sep 03 '22 at 08:38

185

All you need here is importing StringType and using lit and cast:

from pyspark.sql.types import StringType
from pyspark.sql.functions import lit

new_df = old_df.withColumn('new_column', lit(None).cast(StringType()))

A full example:

df = sc.parallelize([row(1, "2"), row(2, "3")]).toDF()
df.printSchema()
# root
#  |-- foo: long (nullable = true)
#  |-- bar: string (nullable = true)

new_df = df.withColumn('new_column', lit(None).cast(StringType()))

new_df.printSchema()
# root
#  |-- foo: long (nullable = true)
#  |-- bar: string (nullable = true)
#  |-- new_column: string (nullable = true)

new_df.show()
# +---+---+----------+
# |foo|bar|new_column|
# +---+---+----------+
# |  1|  2|      null|
# |  2|  3|      null|
# +---+---+----------+

A Scala equivalent can be found here: Create new Dataframe with empty/null field values

edited Sep 03 '22 at 08:38

ZygD

22,092
39
79
102

answered Oct 09 '15 at 15:24

zero323

322,348
103
959
935

How to do this conditionally, if the column does not exist in the first place? I am trying to use UDF and pass the DF to it and then doing `new_column not in df.columns` check, but can't make it to work. – Gopala Jul 12 '16 at 20:09
I looked at it too, but I was still not able to incorporate that conditionally into a `withColumn('blah', where(has_column(df['blah']) == False)....` kind of construct. Must be missing some syntactical construct. I want to add a column with Nulls, if it does not exist. This answer does former, the other one checks latter. – Gopala Jul 12 '16 at 21:57
@Gopala `df if has_column(df) else df.withColumn(....)` - nothing Spark specific. – zero323 Jul 12 '16 at 21:58
1

Darn....I get really mixed up with when python syntax works, and when it does not. For example, you can't have conditional code within the `withColumn` and have to use UDFs. Thank you! – Gopala Jul 12 '16 at 22:00

Shrikant Prabhu · Answer 2 · 2017-10-20T10:18:54.447

12

I would cast lit(None) to NullType instead of StringType. So that if we ever have to filter out not null rows on that column...it can be easily done as follows

df = sc.parallelize([Row(1, "2"), Row(2, "3")]).toDF()

new_df = df.withColumn('new_column', lit(None).cast(NullType()))

new_df.printSchema() 

df_null = new_df.filter(col("new_column").isNull()).show()
df_non_null = new_df.filter(col("new_column").isNotNull()).show()

Also be careful about not using lit("None")(with quotes) if you are casting to StringType since it would fail for searching for records with filter condition .isNull() on col("new_column").

edited Oct 20 '17 at 10:18

answered Jun 26 '17 at 04:47

Shrikant Prabhu

709
8
13

2

Error: `Parquet data source does not support null data type.;`. `StringType()` worked. – ZygD Feb 15 '21 at 14:42
This could be a very bad idea depending on your usecase, as it casts the column as a Void type, and thus nothing can be inserted other than null if you write this out to some kind of storage, e.g. delta format. – Chris Ivan Mar 29 '23 at 07:40

ZygD · Answer 3 · 2022-09-03T08:42:24.763

The option without import StringType

df = df.withColumn('foo', F.lit(None).cast('string'))

Full example:

from pyspark.sql import functions as F
df = spark.range(1, 3).toDF('c')

df = df.withColumn('foo', F.lit(None).cast('string'))

df.printSchema()
#     root
#      |-- c: long (nullable = false)
#      |-- foo: string (nullable = true)

df.show()
#     +---+----+
#     |  c| foo|
#     +---+----+
#     |  1|null|
#     |  2|null|
#     +---+----+

score 1 · Answer 4 · answered Mar 13 '23 at 07:54

df1.selectExpr("school","null as col1").show()

output：

+--------------------+----+
|              school|col1|
+--------------------+----+
|Shanghai Jiao Ton...|null|
|   Peking University|null|
|Shanghai Jiao Ton...|null|
|    Fudan University|null|
|    Fudan University|null|
| Tsinghua University|null|
|Shanghai Jiao Ton...|null|
| Tsinghua University|null|
| Tsinghua University|null|
|   Peking University|null|

or in pyspark 2.2+

df1.pandas_api().assign(new_column=None)

score 0 · Answer 5 · answered Aug 08 '23 at 05:38

0

Why not just use this?

from pyspark.sql.functions import lit
df.withColumn("column_name", lit("")).show()

answered Aug 08 '23 at 05:38

Underoos

4,708
8
42
85

Add an empty column to Spark DataFrame

5 Answers5

Linked