0

I want to replace null values with the value of the top row in Spark 2.3. An example of dataframe:

|COLUMN |
|-------|
|10     |
|30     |
|null   |
|80     |
---------

The result I want would be:

|COLUMN |
|-------|
|10     |
|30     |
|30     |
|80     |
---------

Thanks.

thebluephantom
  • 16,458
  • 8
  • 40
  • 83
Sorul
  • 324
  • 2
  • 12
  • 4
    Possible duplicate of [Spark / Scala: forward fill with last observation](https://stackoverflow.com/questions/33621319/spark-scala-forward-fill-with-last-observation). Note that you usually need a column to sort by since a dataframe does not have a guaranteed order otherwise. – Shaido Aug 23 '19 at 08:04
  • 1
    Reading in from file has a guaranteed ordering. That said, zipWithIndex is advisable. – thebluephantom Aug 23 '19 at 21:27
  • Example unclear. top of what, 80 or 30? Show more data points. – thebluephantom Aug 23 '19 at 21:30
  • @thebluephantom Copy the value of the cell above the null one. Thats why there are two 30. – Sorul Aug 25 '19 at 06:49
  • that is why i changed title, do you still want answer? not possible now. the duplicate helps? not so easy in terms of performance. – thebluephantom Aug 25 '19 at 08:11
  • @thebluephantom I expected a more intuitive response than the duplicate.I thought that in Spark 2.3 a simple way was possible. I will try to solve it on my own. – Sorul Aug 25 '19 at 08:49
  • The duplicate answer is very good, given by some of the finest. There are some things Spark was less meant for, namely sequentialness. Cannot get much better than zero323. – thebluephantom Aug 25 '19 at 09:05
  • @thebluephatntom You're right, but the duplicate is in Spark 1.6 (2016), so I hoped that with a new version like Spark 2.3 there would be a better way to solve this problem. – Sorul Aug 25 '19 at 09:32
  • No it is not. You probably fail to grasp the point on partitioning. In you example you have timeseries of sorts and no major key to scope within and restrict the data. Even zipWithIndex has issues. Range Partitioning may help, but for 15 pts we are looking at a lot of effort. The new version of Spark, or Versions do not fundamentally alter the shared nothing paradigm and partitioning, shuffling for parallel programming. Edge cases are complicated. I would look at the duplicate to be honest. – thebluephantom Aug 25 '19 at 09:49

0 Answers0