0

I have a data frame in pyspark like below.

df.show()
+---+----+
| id|name|
+---+----+
|  1| sam|
|  2| Tim|
|  3| Jim|
|  4| sam|
+---+----+

Now I added a new column to the df like below

from pyspark.sql.functions import lit
from pyspark.sql.types import StringType
new_df = df.withColumn('new_column', lit(None).cast(StringType()))

Now when I query the new_df

new_df.show()
+---+----+----------+
| id|name|new_column|
+---+----+----------+
|  1| sam|      null|
|  2| Tim|      null|
|  3| Jim|      null|
|  4| sam|      null|
+---+----+----------+

Now I want to update the value in new_column based on a condition.

I am trying to write the below condition but unable to do so.

if name is sam then new_column should be tested else not_tested

if name == sam:
    then update new_column to tested
else:
    new_column == not_tested

How can I achieve this in pyspark.

Edit I am not looking for a if else statement but how to update the values of a record in pyspark column

User12345
  • 5,180
  • 14
  • 58
  • 105

1 Answers1

0

@user9367133 Thank you for reaching out, if you follow my answer on similiar question you pointed , its pretty much same logic -

from pyspark.sql.functions import *

new_df\
.drop(new_df.new_column)\
.withColumn('new_column',when(new_df.name == "sam","tested").otherwise('not_tested'))\
.show()

You dont necessarily have to add new_column before hand as null if you are just going to replace with proper values immediately. But I wasnt sure about use case, so I kept it in my example.

hope this helps, cheers!

Pushkr
  • 3,591
  • 18
  • 31