My goal is to replace all negative elements in a column of a PySpark.DataFrame with zero.
input data
+------+
| col1 |
+------+
| -2 |
| 1 |
| 3 |
| 0 |
| 2 |
| -7 |
| -14 |
| 3 |
+------+
desired output data
+------+
| col1 |
+------+
| 0 |
| 1 |
| 3 |
| 0 |
| 2 |
| 0 |
| 0 |
| 3 |
+------+
Basically I can do this as below:
df = df.withColumn('col1', F.when(F.col('col1') < 0, 0).otherwise(F.col('col1'))
or udf can be defined as
import pyspark.sql.functions as F
smooth = F.udf(lambda x: x if x > 0 else 0, IntegerType())
df = df.withColumn('col1', smooth(F.col('col1')))
or
df = df.withColumn('col1', (F.col('col1') + F.abs('col1')) / 2)
or
df = df.withColumn('col1', F.greatest(F.col('col1'), F.lit(0))
My question is, which one is the most efficient way of doing this? Udf has optimization issues, so absolutely it's not the correct way of doing this. But I don't know how to approach comparing the other two cases. One answer should be absolutely making experiments and comparing the mean running times and so on. But I want to compare these approaches (and new approaches) theoretically.
Thanks in advance...