2

I am new to PySpark and have been trying a few stuff.

I have a data frame as follows

+----------+-----------+
|   Column1|    Column2|
+----------+-----------+
|    VALUE1|      30000|
|    VALUE2|      25000|
|    VALUE3|      20000|
|    VALUE4|      19500|
|    VALUE5|      18100|
+----------+-----------+

I want to add a new column such that its value is as per the following formula

CurrentRow[Column3] = 
    IF (CurrentRow[Column2] > PreviousRow[Column3]) 
    THEN PreviousRow[Column3]
    ELSE CurrentRow[Column2] * 0.9

Example below

+----------+------------------+------------------+
|   Column1|           Column2|           Column3|
+----------+------------------+------------------+
|    VALUE1|             30000|             27000|
|    VALUE2|             25000|             22500|
|    VALUE3|             20000|             18000|
|    VALUE4|             19500|             18000|
|    VALUE5|             18100|             18000|
+----------+------------------+------------------+

I tried searching for the lag function on the same column that is being updated (withColumn) but could not succeed

SamaAdi
  • 41
  • 1
  • 6
  • Did you investigate the usage of the pyspark lag function? Can you post any code you have tried already? – Alexandre Juma May 19 '22 at 12:21
  • @AlexandreJuma Major challenge that I am facing is that I cannot access Column3 while creating the same, which I guess is not possible using DataFrames. – SamaAdi May 19 '22 at 14:11
  • @AlexandreJuma I tried using the following code but got the wrong result. `w = Window.orderBy(F.col('Column2').desc())` `df = df.withColumn("temp_column", F.col('Column2') * F.lit(0.9))` `df.withColumn("Column3", F.when(F.col('Column2') >= F.lit(0.9) * F.lag('Column2)', 1).over(w), F.lag('temp_column', 1).over(w)).otherwise(0.9 * F.col('Column2')) ).show() ` – SamaAdi May 19 '22 at 14:22

0 Answers0