What is the most efficient way to randomly change values into null values in pyspark?

Question

Trying to figure out how to replace a specific column in Pyspark with null values randomly. So changing a dataframe such as this:

| A  | B  |
|----|----|
| 1  | 2  |
| 3  | 4  |
| 5  | 6  |
| 7  | 8  |
| 9  | 10 |
| 11 | 12 |

and randomly change 25% of the values in column 'B' to null values:

| A  | B    |
|----|------|
| 1  | 2    |
| 3  | NULL |
| 5  | 6    |
| 7  | NULL |
| 9  | NULL |
| 11 | 12   |

Use [`pyspark.sql.functions.rand`](https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.rand) with [`when`](https://stackoverflow.com/questions/39048229/spark-equivalent-of-if-then-else). If the random value < 0.25, replace it with `null`. Here is an example which is very similar: [Spark dataframe add new column with random data](https://stackoverflow.com/questions/41459138/spark-dataframe-add-new-column-with-random-data). It's not an exact dupe so if that doesn't answer your question, I can post an answer. — pault, Sep 18 '20 at 14:14
See also: [Random numbers generation in PySpark](https://stackoverflow.com/questions/31900124/random-numbers-generation-in-pyspark) — pault, Sep 18 '20 at 14:16

score 0 · Accepted Answer · answered Sep 21 '20 at 20:32

thanks to @pault I was able to answer my own question using the question he posted that you can find here

Essentially I ran something like this:

import pyspark.sql.functions as f
df1 = df.withColumn('Val', f.when(f.rand() > 0.25, df1['Val']).otherwise(f.lit(None))

Which will randomly select values with the column 'Val' and make it into a None value

1 Answers1