1

I have a pySpark dataframe and need to compute a column which depends on the value of the same column in the previous row. But instead of using the old value of this column of the previous row, I need the new one, to which the calculation was already applied.

Specifically, I need to compute the following: Given a dataframe with a column A in a specific order order. Compute a column B sequentially as B = MAX(0, LAG(B) - LAG(A)), starting with a default value of 0 for the first row.

Example:

Input:
order | A  
------|----
  0   | -1
  1   | -2
  2   |  4
  3   |  4
  4   | -1
  5   |  4
  6   | -1

Wanted output:
order | A  | B
------|----|---
  0   | -1 | 0 <- B here is here set to 0
  1   | -2 | 1 
  2   |  4 | 3
  3   |  4 | 0
  4   | -1 | 0
  5   |  4 | 1
  6   | -1 | 0

Using the default F.lag window function does not work, since this one yields only the old previous row, since otherwise distributed computing is no longer possible, if it needs to be computed sequentially.

Yannic
  • 698
  • 8
  • 22
  • [This](https://stackoverflow.com/questions/70850077/calculating-column-value-in-current-row-of-spark-dataframe-based-on-the-calculat/70850969#70850969) might help – blackbishop Jun 09 '22 at 16:39
  • Looks like what I am searching for, thanks! But the computation has a lot of redundancy in it and turns from a linear runtime into a quadratic one, which might be a problem for big dataframes. – Yannic Jun 10 '22 at 07:05

0 Answers0