Performance issue while using Window

Question

I have a time series data ,I want to get the interval of data in such a way that if 1 is detected in detector column then it will be the end of one interval and start of other interval .I can do this with groupby but I want an alternative method to do so because of the performance issue while using groupby and also simultaneously detecting the interval in such a way if the difference between time for two consecutive rows is greater than or equal to 15.

For simplicity we can take an example like below

time | detector
5    | 0
10   | 0
15   | 0
20   | 0
25   | 1
35   | 0
40   | 0
56   | 0
57   | 0
55   | 0
60   | 1
65   | 0
70   | 0
75   | 0
80   | 1
85   | 0

Output I want is

interval 
[5,25]   
[25,60]  
[40,56]  
[60,80]  
[80,85]

update 1:

val wAll = Window.partitionBy(col("imei")).orderBy(col("time").asc)  

 val test= df.withColumn("lead_time", lead("time", 1, null).over(wAll)).withColumn("runningTotal", sum("detector").over(wAll))
        .groupBy("runningTotal").agg(struct(min("time"), max("lead_time")).as("interval"))

This is for calculation of data points greater than equal to 15min

val unreachable_df=df
    .withColumn("lag_time",lag("time", 1, null).over(wAll)) 
    .withColumn("diff_time",abs((col("time") - col("lag_time"))/60D))
   .withColumn("unreachable",when(col("diff_time")>=15.0,0).otherwise(1))
    .drop(col("diff_time"))
    .drop(col("lag_time"))
    .withColumn("runningTotal", sum("unreachable").over(wAll)) 
    .groupBy("runningTotal") 
    .agg(struct(min("time"), max("time")).as("interval"))
        .withColumn("diff_interval",abs((unix_timestamp(col("interval.col1"))-unix_timestamp(col("interval.col2"))))) 
        .filter(col("diff_interval")>0) .drop("diff_interval")
        .withColumn("type",lit("Unreachable")).drop("runningTotal")

Then I have merged the two dataframe to get the above result

val merged_df=test.union(unreachable_df).sort(col("interval.col1"))

Can you please supply the code with the performance issue so we can help you work that out ? — eliasah, Aug 28 '18 at 06:52
why dont you post the whole schema of your input df? & the test data. — tauitdnmd, Aug 28 '18 at 07:12
Your window function doesn’t have a partition column this all of your data will go into one partition. I’d start fixing that. — eliasah, Aug 28 '18 at 07:12
@eliasah I have posted just the sample data whereas the original data is a GPS data consist of different ids so I have partitioned on the basis of column id — experiment, Aug 28 '18 at 07:15
The issue isn’t in the group by itself. You’ll need to find something to partition on. Yet again the need of a MVCE if you want some help. Think about SoC (separation of concerns) — eliasah, Aug 28 '18 at 07:18
@eliasah please suggest another way to solve the above problem without using groupby I will figure my way around the my data. — experiment, Aug 28 '18 at 08:28
I suggested a solution and like I said it's not the group by that is causing problem. I can't give an exact answer to this because it's broad it's missing a MVCE like I also said before. — eliasah, Aug 28 '18 at 08:29
@eliasah sir as I have told you in above comments that its a gps data of various cars means that every car has its unique id and its gps data so I have applied partition on the basis of car id .But still it is not giving good performance. — experiment, Aug 28 '18 at 08:33

Performance issue while using Window

0 Answers0