6

In Spark, literal columns, when added, are not nullable:

from pyspark.sql import SparkSession, functions as F
spark = SparkSession.builder.getOrCreate()

df = spark.createDataFrame([(1,)], ['c1'])

df = df.withColumn('c2', F.lit('a'))

df.printSchema()
#  root
#   |-- c1: long (nullable = true)
#   |-- c2: string (nullable = false)

How to create a nullable column?

ZygD
  • 22,092
  • 39
  • 79
  • 102
  • the real question is "why would you need to have a lit null column ... – Steven Jul 29 '21 at 15:04
  • In my case, I needed to create the schema identic to another dataframe, but with different data in it. This includes nullability. – ZygD Jul 29 '21 at 15:07
  • why don't you create a schema directly then ? starting from df.schema. – Steven Jul 29 '21 at 15:08
  • I thnik you should create another question with your original usecase. Currently, you are trying to find help for a solution you imagined but probably not the proper one. It is called [XY_problem](https://en.wikipedia.org/wiki/XY_problem) – Steven Jul 29 '21 at 15:15
  • Sorry, but If you cannot imagine a use case it does not mean it does not exist. I had this issue and found an answer [here](https://dev.to/kevinwallimann/how-to-make-a-column-non-nullable-in-spark-structured-streaming-4b62). Then I made it even better so I decided to post it here. Later I read more and found [this highly-upvoted answer](https://stackoverflow.com/questions/33193958#46119565) to a different problem. I think those cases prove that use case exists, whatever original issue people may have. I just wanted to help someone else to find the answer easier. – ZygD Jul 29 '21 at 18:56

1 Answers1

8

The shortest method I've found - using when (the otherwise clause seems not needed):

df = df.withColumn('c2', F.when(F.lit(True), F.lit('a')))

If in Scala: .withColumn("c2", when(lit(true), lit("a")))


Full test result:

from pyspark.sql import SparkSession, functions as F
spark = SparkSession.builder.getOrCreate()

df = spark.createDataFrame([(1,)], ['c1'])
df = df.withColumn('c2', F.when(F.lit(True), F.lit('a')))

df.show()
#  +---+---+
#  | c1| c2|
#  +---+---+
#  |  1|  a|
#  +---+---+

df.printSchema()
#  root
#   |-- c1: long (nullable = true)
#   |-- c2: string (nullable = true)
ZygD
  • 22,092
  • 39
  • 79
  • 102