95

As mentioned in many other locations on the web, adding a new column to an existing DataFrame is not straightforward. Unfortunately it is important to have this functionality (even though it is inefficient in a distributed environment) especially when trying to concatenate two DataFrames using unionAll.

What is the most elegant workaround for adding a null column to a DataFrame to facilitate a unionAll?

My version goes like this:

from pyspark.sql.types import StringType
from pyspark.sql.functions import UserDefinedFunction
to_none = UserDefinedFunction(lambda x: None, StringType())
new_df = old_df.withColumn('new_column', to_none(df_old['any_col_from_old']))
zero323
  • 322,348
  • 103
  • 959
  • 935
architectonic
  • 2,871
  • 2
  • 21
  • 35

5 Answers5

185

All you need here is importing StringType and using lit and cast:

from pyspark.sql.types import StringType
from pyspark.sql.functions import lit

new_df = old_df.withColumn('new_column', lit(None).cast(StringType()))

A full example:

df = sc.parallelize([row(1, "2"), row(2, "3")]).toDF()
df.printSchema()
# root
#  |-- foo: long (nullable = true)
#  |-- bar: string (nullable = true)

new_df = df.withColumn('new_column', lit(None).cast(StringType()))

new_df.printSchema()
# root
#  |-- foo: long (nullable = true)
#  |-- bar: string (nullable = true)
#  |-- new_column: string (nullable = true)

new_df.show()
# +---+---+----------+
# |foo|bar|new_column|
# +---+---+----------+
# |  1|  2|      null|
# |  2|  3|      null|
# +---+---+----------+

A Scala equivalent can be found here: Create new Dataframe with empty/null field values

ZygD
  • 22,092
  • 39
  • 79
  • 102
zero323
  • 322,348
  • 103
  • 959
  • 935
  • How to do this conditionally, if the column does not exist in the first place? I am trying to use UDF and pass the DF to it and then doing `new_column not in df.columns` check, but can't make it to work. – Gopala Jul 12 '16 at 20:09
  • I looked at it too, but I was still not able to incorporate that conditionally into a `withColumn('blah', where(has_column(df['blah']) == False)....` kind of construct. Must be missing some syntactical construct. I want to add a column with Nulls, if it does not exist. This answer does former, the other one checks latter. – Gopala Jul 12 '16 at 21:57
  • @Gopala `df if has_column(df) else df.withColumn(....)` - nothing Spark specific. – zero323 Jul 12 '16 at 21:58
  • 1
    Darn....I get really mixed up with when python syntax works, and when it does not. For example, you can't have conditional code within the `withColumn` and have to use UDFs. Thank you! – Gopala Jul 12 '16 at 22:00
12

I would cast lit(None) to NullType instead of StringType. So that if we ever have to filter out not null rows on that column...it can be easily done as follows

df = sc.parallelize([Row(1, "2"), Row(2, "3")]).toDF()

new_df = df.withColumn('new_column', lit(None).cast(NullType()))

new_df.printSchema() 

df_null = new_df.filter(col("new_column").isNull()).show()
df_non_null = new_df.filter(col("new_column").isNotNull()).show()

Also be careful about not using lit("None")(with quotes) if you are casting to StringType since it would fail for searching for records with filter condition .isNull() on col("new_column").

Shrikant Prabhu
  • 709
  • 8
  • 13
  • 2
    Error: `Parquet data source does not support null data type.;`. `StringType()` worked. – ZygD Feb 15 '21 at 14:42
  • This could be a very bad idea depending on your usecase, as it casts the column as a Void type, and thus nothing can be inserted other than null if you write this out to some kind of storage, e.g. delta format. – Chris Ivan Mar 29 '23 at 07:40
4

The option without import StringType

df = df.withColumn('foo', F.lit(None).cast('string'))

Full example:

from pyspark.sql import functions as F
df = spark.range(1, 3).toDF('c')

df = df.withColumn('foo', F.lit(None).cast('string'))

df.printSchema()
#     root
#      |-- c: long (nullable = false)
#      |-- foo: string (nullable = true)

df.show()
#     +---+----+
#     |  c| foo|
#     +---+----+
#     |  1|null|
#     |  2|null|
#     +---+----+
ZygD
  • 22,092
  • 39
  • 79
  • 102
1
df1.selectExpr("school","null as col1").show()

output:

+--------------------+----+
|              school|col1|
+--------------------+----+
|Shanghai Jiao Ton...|null|
|   Peking University|null|
|Shanghai Jiao Ton...|null|
|    Fudan University|null|
|    Fudan University|null|
| Tsinghua University|null|
|Shanghai Jiao Ton...|null|
| Tsinghua University|null|
| Tsinghua University|null|
|   Peking University|null|

or in pyspark 2.2+

df1.pandas_api().assign(new_column=None)
G.G
  • 639
  • 1
  • 5
0

Why not just use this?

from pyspark.sql.functions import lit
df.withColumn("column_name", lit("")).show()
Underoos
  • 4,708
  • 8
  • 42
  • 85