I have a situation where my dataframe has 3 columns, out of these three columns there is a possibility that there are nulls in column3. The total records in this DF is 2 million. I would need to fill this null values via a value from mysql database(basically calling a function which returns a value). I can think about looping over each row but this could be much more time consuming given the amount of data. How can I achieve this, I know how to fill the nulls with a static value but this is completely dynamic. Thanks for the help Regards, Robin
-
you want to replace null with one single value return by your function right or i interpret the question incorrectly – Ankit Kumar Namdeo Sep 08 '18 at 14:21
-
The function would return a different value each time its called.This would return a sequence and I would need to assign them to each row. – Python Newbie Sep 08 '18 at 18:48
-
1Please read [how to create good reproducible apache spark dataframe examples](https://stackoverflow.com/questions/48427185/how-to-make-good-reproducible-apache-spark-dataframe-examples). – pault Sep 08 '18 at 19:23
2 Answers
If I get your question correctly, you want to have some unique value in a column if there has been a Null
value before. One possible method would be the following code, which checks for Null
values in the value
column. If it finds Null
it will use the monotonically_increasing id to replace the Null
. In the other case the original value will remain.
test_df = spark.createDataFrame([
('a','2018'),
('b',None),
('c',None)
], ("col_name","value"))
test_df.withColumn("col3", when(col("value").isNull(), row_number()).otherwise(col("value"))).show(truncate=False)
Result:
+--------+-----+------------+
|col_name|value|col3 |
+--------+-----+------------+
|a |2018 |2018 |
|b |null |403726925824|
|c |null |609885356032|
+--------+-----+------------+
PS: For future requests, it would be good if you could include a sample from your data set and your desired output. This often helps to understand the problem.

- 1,960
- 2
- 14
- 18
-
Thanks I will try to format and also post an example, your statement helped me find the solution. Although the row_number() didnt work for me( I m using 2.3.0). Probably the row_number() is changed with some lower or uppercase, I tired that as well. I am going to use monotonically_increasing_id() to get incremental values for each row. The last question is I dont want to create the new column but I wanted to update the existing column. – Python Newbie Sep 11 '18 at 09:57
For the above test case I would update the value column for only two rows with below command. test_df.withColumn("value", when(col("value").isNull(), monotonically_increasing_id()).otherwise(col("value"))).show(truncate=False)
Thanks for all the comments and help.

- 45
- 1
- 7