Fill null columns dynamically in dataframe using pyspark

Question

I have a situation where my dataframe has 3 columns, out of these three columns there is a possibility that there are nulls in column3. The total records in this DF is 2 million. I would need to fill this null values via a value from mysql database(basically calling a function which returns a value). I can think about looping over each row but this could be much more time consuming given the amount of data. How can I achieve this, I know how to fill the nulls with a static value but this is completely dynamic. Thanks for the help Regards, Robin

you want to replace null with one single value return by your function right or i interpret the question incorrectly — Ankit Kumar Namdeo, Sep 08 '18 at 14:21
The function would return a different value each time its called.This would return a sequence and I would need to assign them to each row. — Python Newbie, Sep 08 '18 at 18:48
Please read [how to create good reproducible apache spark dataframe examples](https://stackoverflow.com/questions/48427185/how-to-make-good-reproducible-apache-spark-dataframe-examples). — pault, Sep 08 '18 at 19:23

score 0 · Accepted Answer · answered Sep 10 '18 at 13:12

If I get your question correctly, you want to have some unique value in a column if there has been a Null value before. One possible method would be the following code, which checks for Null values in the value column. If it finds Null it will use the monotonically_increasing id to replace the Null. In the other case the original value will remain.

test_df = spark.createDataFrame([
('a','2018'),
                  ('b',None),
                  ('c',None)
], ("col_name","value"))

test_df.withColumn("col3", when(col("value").isNull(), row_number()).otherwise(col("value"))).show(truncate=False)

Result:

+--------+-----+------------+
|col_name|value|col3        |
+--------+-----+------------+
|a       |2018 |2018        |
|b       |null |403726925824|
|c       |null |609885356032|
+--------+-----+------------+

PS: For future requests, it would be good if you could include a sample from your data set and your desired output. This often helps to understand the problem.

Thanks I will try to format and also post an example, your statement helped me find the solution. Although the row_number() didnt work for me( I m using 2.3.0). Probably the row_number() is changed with some lower or uppercase, I tired that as well. I am going to use monotonically_increasing_id() to get incremental values for each row. The last question is I dont want to create the new column but I wanted to update the existing column. — Python Newbie, Sep 11 '18 at 09:57

score 0 · Answer 2 · answered Sep 11 '18 at 10:26

For the above test case I would update the value column for only two rows with below command. test_df.withColumn("value", when(col("value").isNull(), monotonically_increasing_id()).otherwise(col("value"))).show(truncate=False)

Thanks for all the comments and help.

Fill null columns dynamically in dataframe using pyspark

2 Answers2