0

Below is the input dataframe.

+-----------+---------+------------------+----------------------+-----------+
|  DATE    |  ID      |sal               |   vat                |     flag  |
+-----------+---------+------------------+----------------------+------------
|10-may-2022|     1   |             1000.0|                  12.0       1   |
|12-may-2022|     2   |              50.0|                   6.0|       1   |
+-----------+---------+------------------+----------------------+------------

I want to perfrom the below based on the flag column

If the flag column is 1, I will do the below.

df = srcdf.withColumn("sum",col("sal")*2)
display(df)

If the flag column is 2, I will do the below.

df = srcdf.withColumn("sum",col("sal")*4)
display(df)

Below is the code Im using.

flag = srcdf.select(col("flag"))

if flag == 1 :

df = srcdf.withColumn("sum",col("sal")*2)
display(df)

else:
df = srcdf.withColumn("sum",col("sal")*4)
display(df)

When I use the above, I am getting syntax error. Is there any other way I can achieve this using the pyspark conditional statements.

Thank you.

SanjanaSanju
  • 261
  • 2
  • 18

1 Answers1

1

Possible duplicate of this question.

You need to use when with (or without) otherwise from pyspark.sql.functions.

from pyspark.sql.functions import when, col
df = srcdf\
   .withColumn("sum", when(col("flag") == 1, col("sal") * 2)\
                     .when(col("flag") == 2, col("sal") * 4)
   )

OR

from pyspark.sql.functions import when, col
df = srcdf\
   .withColumn("sum", when(col("flag") == 1, col("sal") * 2)\
                     .otherwise(col("sal") * 4)
   )
Rehan Rajput
  • 112
  • 8