39

I want to do something like this:

df.replace('empty-value', None, 'NAME')

Basically, I want to replace some value with NULL, but it does not accept None as an argument. How can I do this?

Bonifacio2
  • 3,405
  • 6
  • 34
  • 54
talloaktrees
  • 3,508
  • 6
  • 28
  • 43

5 Answers5

76

You can combine when clause with NULL literal and types casting as follows:

from pyspark.sql.functions import when, lit, col

df = sc.parallelize([(1, "foo"), (2, "bar")]).toDF(["x", "y"])

def replace(column, value):
    return when(column != value, column).otherwise(lit(None))

df.withColumn("y", replace(col("y"), "bar")).show()
## +---+----+
## |  x|   y|
## +---+----+
## |  1| foo|
## |  2|null|
## +---+----+

It doesn't introduce BatchPythonEvaluation and because of that should be significantly more efficient than using an UDF.

zero323
  • 322,348
  • 103
  • 959
  • 935
9

This will replace empty-value with None in your name column:

from pyspark.sql.functions import udf
from pyspark.sql.types import StringType


df = sc.parallelize([(1, "empty-value"), (2, "something else")]).toDF(["key", "name"])
new_column_udf = udf(lambda name: None if name == "empty-value" else name, StringType())
new_df = df.withColumn("name", new_column_udf(df.name))
new_df.collect()

Output:

[Row(key=1, name=None), Row(key=2, name=u'something else')]

By using the old name as the first parameter in withColumn, it actually replaces the old name column with the new one generated by the UDF output.

Daniel Zolnai
  • 16,487
  • 7
  • 59
  • 71
  • I didn't think of trying UDFs, that seems to be the way to go – talloaktrees Apr 27 '16 at 21:02
  • 2
    Your code will run faster if you use native Spark functions rather than UDFs - see the other answers. (That's why they have more upvotes than the accepted answer) – RobinL Jan 15 '20 at 11:26
8

You could also simply use a dict for the first argument of replace. I tried it and this seems to accept None as an argument.

df = df.replace({'empty-value':None}, subset=['NAME'])

Note that your 'empty-value' needs to be hashable.

Willem
  • 976
  • 9
  • 24
5

The best alternative is the use of a when combined with a NULL. Example:

from pyspark.sql.functions import when, lit, col

df= df.withColumn('foo', when(col('foo') != 'empty-value',col('foo)))

If you want to replace several values to null you can either use | inside the when condition or the powerfull create_map function.

Important to note is that the worst way to solve it with the use of a UDF. This is so because udfs provide great versatility to your code but come with a huge penalty on performance.

Chp11
  • 51
  • 1
  • 5
0

This will also work:

df5 = df.replace(str('empty-value'), None)

  • 1
    Your answer could be improved with additional supporting information. Please [edit] to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers [in the help center](/help/how-to-answer). – Community Jul 09 '23 at 13:20