1

I have some data that is ordered by dateTime, and I want to add up the values based on another column(1 or 0). HOWEVER, I need it done such that it only sums the values up to 5 seconds after. How do I do this?

ex Table

|ID  |GPS_TimeStamp         |overG|
---------------------------------
|aa  |2019-08-01 00:18:05.1 |1    |
|aa  |2019-08-01 00:18:06.3 |0    |
|aa  |2019-08-01 00:18:08.4 |1    |
|aa  |2019-08-01 00:18:10.0 |1    |
|aa  |2019-08-01 00:18:11.1 |0    |
|aa  |2019-08-01 00:18:12.2 |0    |
|aa  |2019-08-01 00:18:13.8 |1    |
|aa  |2019-08-01 00:18:16.1 |0    |
---------------------------------

my pseudo code that doesnt work is as below

myData = myData.withColumn("overG-sum5Seconds", 
   sum(col("overG")).over(Window.partitionBy(
      "GPS_TimeStamp"
   ).orderBy("GPS_TimeStamp").rangeBetween(0, Window.currentRow+timedelta(seconds=5))
   )

for a result that looks like

|ID  |GPS_TimeStamp         |overG|overG-sum5Seconds|
---------------------------------------------------
|aa  |2019-08-01 00:18:05.1 |1    |3                |
|aa  |2019-08-01 00:18:06.3 |0    |2                |
|aa  |2019-08-01 00:18:08.4 |1    |3                |
|aa  |2019-08-01 00:18:10.0 |1    |2                |
|aa  |2019-08-01 00:18:11.1 |0    |1                |
|aa  |2019-08-01 00:18:12.2 |0    |1                |
|aa  |2019-08-01 00:18:13.8 |1    |1                |
|aa  |2019-08-01 00:18:16.1 |0    |0                |
---------------------------------------------------

I cannot use lag or lead because not every second is in the list. So it has to be a condition based on GPS_TimeStamp.

Thanks in advance

Ranald Fong
  • 401
  • 3
  • 12

2 Answers2

1

Found my answer after visiting a few sites.

https://www.linkedin.com/pulse/time-series-moving-average-apache-pyspark-laurent-weichberger

Turns out I wanted a sliding avg/sum

myData = myData.withColumn("unix", (unix_timestamp("GPS_TimeStamp"))+ expr("substr(GPS_TimeStamp,instr(GPS_TimeStamp, '.'))"))
w = (Window.partitionBy("id").orderBy(col("unix")).rangeBetween(0, 5))
myData = myData.withColumn('rolling_sum', sum("overG").over(w))
Ranald Fong
  • 401
  • 3
  • 12
0

Window Function Frame can solve your problem. Window Frames In short, all you have to do is the conditional cumulative sum you can refer this answer as well, How to get cumulative sum.

aksr
  • 322
  • 1
  • 2
  • 11
  • Can you please provide code? Those links all use ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW, but I need something like rangeBetween(CURRENT ROW AND (CURRENT ROW)+5seconds) – Ranald Fong Oct 08 '19 at 00:32
  • Also, I'm looking at more of a sliding window sum not a rolling cumulative. – Ranald Fong Oct 08 '19 at 02:10