I have some data that is ordered by dateTime, and I want to add up the values based on another column(1 or 0). HOWEVER, I need it done such that it only sums the values up to 5 seconds after. How do I do this?
ex Table
|ID |GPS_TimeStamp |overG|
---------------------------------
|aa |2019-08-01 00:18:05.1 |1 |
|aa |2019-08-01 00:18:06.3 |0 |
|aa |2019-08-01 00:18:08.4 |1 |
|aa |2019-08-01 00:18:10.0 |1 |
|aa |2019-08-01 00:18:11.1 |0 |
|aa |2019-08-01 00:18:12.2 |0 |
|aa |2019-08-01 00:18:13.8 |1 |
|aa |2019-08-01 00:18:16.1 |0 |
---------------------------------
my pseudo code that doesnt work is as below
myData = myData.withColumn("overG-sum5Seconds",
sum(col("overG")).over(Window.partitionBy(
"GPS_TimeStamp"
).orderBy("GPS_TimeStamp").rangeBetween(0, Window.currentRow+timedelta(seconds=5))
)
for a result that looks like
|ID |GPS_TimeStamp |overG|overG-sum5Seconds|
---------------------------------------------------
|aa |2019-08-01 00:18:05.1 |1 |3 |
|aa |2019-08-01 00:18:06.3 |0 |2 |
|aa |2019-08-01 00:18:08.4 |1 |3 |
|aa |2019-08-01 00:18:10.0 |1 |2 |
|aa |2019-08-01 00:18:11.1 |0 |1 |
|aa |2019-08-01 00:18:12.2 |0 |1 |
|aa |2019-08-01 00:18:13.8 |1 |1 |
|aa |2019-08-01 00:18:16.1 |0 |0 |
---------------------------------------------------
I cannot use lag or lead because not every second is in the list. So it has to be a condition based on GPS_TimeStamp.
Thanks in advance