2

I have a situation where my dataframe has 3 columns, out of these three columns there is a possibility that there are nulls in column3. The total records in this DF is 2 million. I would need to fill this null values via a value from mysql database(basically calling a function which returns a value). I can think about looping over each row but this could be much more time consuming given the amount of data. How can I achieve this, I know how to fill the nulls with a static value but this is completely dynamic. Thanks for the help Regards, Robin

Python Newbie
  • 45
  • 1
  • 7
  • you want to replace null with one single value return by your function right or i interpret the question incorrectly – Ankit Kumar Namdeo Sep 08 '18 at 14:21
  • The function would return a different value each time its called.This would return a sequence and I would need to assign them to each row. – Python Newbie Sep 08 '18 at 18:48
  • 1
    Please read [how to create good reproducible apache spark dataframe examples](https://stackoverflow.com/questions/48427185/how-to-make-good-reproducible-apache-spark-dataframe-examples). – pault Sep 08 '18 at 19:23

2 Answers2

0

If I get your question correctly, you want to have some unique value in a column if there has been a Null value before. One possible method would be the following code, which checks for Null values in the value column. If it finds Null it will use the monotonically_increasing id to replace the Null. In the other case the original value will remain.

test_df = spark.createDataFrame([
('a','2018'),
                  ('b',None),
                  ('c',None)
], ("col_name","value"))

test_df.withColumn("col3", when(col("value").isNull(), row_number()).otherwise(col("value"))).show(truncate=False)

Result:

+--------+-----+------------+
|col_name|value|col3        |
+--------+-----+------------+
|a       |2018 |2018        |
|b       |null |403726925824|
|c       |null |609885356032|
+--------+-----+------------+

PS: For future requests, it would be good if you could include a sample from your data set and your desired output. This often helps to understand the problem.

gaw
  • 1,960
  • 2
  • 14
  • 18
  • Thanks I will try to format and also post an example, your statement helped me find the solution. Although the row_number() didnt work for me( I m using 2.3.0). Probably the row_number() is changed with some lower or uppercase, I tired that as well. I am going to use monotonically_increasing_id() to get incremental values for each row. The last question is I dont want to create the new column but I wanted to update the existing column. – Python Newbie Sep 11 '18 at 09:57
0

For the above test case I would update the value column for only two rows with below command. test_df.withColumn("value", when(col("value").isNull(), monotonically_increasing_id()).otherwise(col("value"))).show(truncate=False)

Thanks for all the comments and help.

Python Newbie
  • 45
  • 1
  • 7