0

I have a table like this.

Id   prod   val                   
1    0       0         
2    0       0         
3    1       1000         
4    0       0         
5    1       2000         
6    0       0          
7    0       0         

I want to add a new column new_val and the condition for this column is, if prod = 0, then new_val should be from the preceding row where prod = 1.
If prod = 1 it should have the same value as val column. How do I achieve this using spark sql?

Id   prod   val       new_val                 
1    0       0        1000            
2    0       0        1000             
3    1       1000     1000                
4    0       0        2000                         
5    1       2000     2000               
6    1       4000     4000             
7    1       3000     3000       

Any help is greatly appreciated

ernest_k
  • 44,416
  • 5
  • 53
  • 99
Sujatha
  • 21
  • 1
  • 7
  • you may check this [post](https://stackoverflow.com/questions/43773520/pyspark-how-to-backfill-a-dataframe) – Vivek Atal Apr 08 '20 at 17:07

2 Answers2

0

You can use something like this:

import pyspark.sql.functions as F
from pyspark.sql.window import Window

w = Window().orderBy("id")

df = df.withColumn("new_val", F.when(F.col("prod") == 0, F.lag("val").over(w)).otherwise(F.col("val")))

What we are basically doing is using an if-else condition:

When prod == 0, take lag of val which is value of previous row (over a window that is ordered by id column), and if prod == 1, then we use the present value of the column.

pissall
  • 7,109
  • 2
  • 25
  • 45
0

You can acheive that with

val w = Window.orderBy("id").rowsBetween(0, Window.unboundedFollowing)
df
  .withColumn("new_val", when($"prod" === 0, null).otherwise($"val"))
  .withColumn("new_val", first("new_val", ignoreNulls = true).over(w))

It first, creates new column with null values whenever value doesn't change:

+---+----+----+-------+
| id|prod| val|new_val|
+---+----+----+-------+
|  1|   0|   0|   null|
|  2|   0|   0|   null|
|  3|   1|1000|   1000|
|  4|   0|   0|   null|
|  5|   1|2000|   2000|
|  6|   1|4000|   4000|
|  7|   1|3000|   3000|
+---+----+----+-------+

And it replaces values with first non-null value in the following records

+---+----+----+-------+
| id|prod| val|new_val|
+---+----+----+-------+
|  1|   0|   0|   1000|
|  2|   0|   0|   1000|
|  3|   1|1000|   1000|
|  4|   0|   0|   2000|
|  5|   1|2000|   2000|
|  6|   1|4000|   4000|
|  7|   1|3000|   3000|
+---+----+----+-------+
bottaio
  • 4,963
  • 3
  • 19
  • 43