You can use withColumn
to overwrite the existing new_student_id
column with the original new_student_id
value if is not null, or otherwise the value from the student_id
column is used:
from pyspark.sql.functions import col,when
#Create sample data
students = sc.parallelize([(123,'B',234),(555,'A',None)]).toDF(['student_id','grade','new_student_id'])
#Use withColumn to use student_id when new_student_id is not populated
cleaned = students.withColumn("new_student_id",
when(col("new_student_id").isNull(), col("student_id")).
otherwise(col("new_student_id")))
cleaned.show()
Using your sample data as input:
+----------+-----+--------------+
|student_id|grade|new_student_id|
+----------+-----+--------------+
| 123| B| 234|
| 555| A| null|
+----------+-----+--------------+
the output data looks as follows:
+----------+-----+--------------+
|student_id|grade|new_student_id|
+----------+-----+--------------+
| 123| B| 234|
| 555| A| 555|
+----------+-----+--------------+