How to access previous row calculated value in the current row in Spark data frame

Question

I have client , timestamp and all as columns and i need to achieve a column named "required"

The required column is result of difference of current row and previous row value of "all" column and the list element of current row.

However, the the result of current row will should be used as previous row for calculating the difference between next column. How can I get previous row calculated value in the next row using spark Scala. I used below udf to achieve.

   +--------------+-------------------+--------------------------------------------------+---------------------------------------------
|CLIENT_ID |timestamp          |all                                               |Required                                          
+--------------+-------------------+--------------------------------------------------+--------------------------------------------
|69415092|2002-03-15 00:00:00|[[06,718], [07,718]]                              |[[06,718], [07,718]]                                               
|69415092|2002-03-19 00:00:00|[[10,718]]                                        |[[06,718], [07,718],[10,718]]         
|69415092|2002-03-22 00:00:00|[[06,223],[12,718]]                               |[[07,718],[10,718],[12,718],[06,223]]                    
|69415092|2002-11-16 00:00:00|[[12,386]]                                        |[[07,718],[10,718],[06,223],[12,386]]

But the calculated value is not updated in the existing column.

val window = Window.partitionBy("CLIENT_ID").orderBy("timestamp")
def fun1(s1: Seq[String],s2: Seq[String]): Seq[String] = {
var un= s2.diff(s1)
if( un.contains("0") || un.isEmpty){
un=s1
}
else{
var a = un.toArray
un =concat(a,s1.toArray)
}
 un
}
val funUdf = udf(fun1 _)


  var uniondf = df3.withColumn("Required", funUdf("all",lag("all", 1, Array("0")).over(window))).select("CLP_CLIENT_ID","timestamp","all","Required")
    uniondf.show(false)

You might find [How do I format my posts using Markdown or HTML?](https://stackoverflow.com/help/formatting) useful. — Alper t. Turker, Jan 24 '18 at 21:07
And to solve this with Spark SQL, you have to [define `UserDefinedAggregateFunction`](https://stackoverflow.com/q/32100973/8371915) but it will be very slow, when used with `Arrays`. — Alper t. Turker, Jan 24 '18 at 21:22
The schema of the dataframe : root |-- CLP_CLIENT_ID: string (nullable = true) |-- timestamp: string (nullable = true) |-- all: array (nullable = true) | |-- element: struct (containsNull = true) | | |-- CLP_PHONE_TYPE_S_CD: string (nullable = true) | | |-- CLP_AREA_CD: string (nullable = true) — Janani Eshwaran, Jan 26 '18 at 15:13

How to access previous row calculated value in the current row in Spark data frame

0 Answers0