26

My dataset looks like this:

KEY |Event_Type | metric | Time 
001 |event1     | 10     | 2016-05-01 10:50:51
002 |event2     | 100    | 2016-05-01 10:50:53
001 |event3     | 20     | 2016-05-01 10:50:55
001 |event1     | 15     | 2016-05-01 10:51:50
003 |event1     | 13     | 2016-05-01 10:55:30
001 |event2     | 12     | 2016-05-01 10:57:00
001 |event3     | 11     | 2016-05-01 11:00:01

I want to get all when the keys that verify this:

"SUM of metric for a specific event" > threshold during 5 minutes .

This appear to me a perfect candidate for using the Sliding Windows Functions .

How can I do this with Spark SQL ?

Thank you.

Community
  • 1
  • 1
Nabil
  • 1,771
  • 4
  • 21
  • 33

2 Answers2

58

Spark >= 2.0

You can use window (not to be mistaken with window functions). Depending on a variant it assigns timestamp, to one more, potentially overlapping buckets:

df.groupBy($"KEY", window($"time", "5 minutes")).sum("metric")

// +---+---------------------------------------------+-----------+
// |KEY|window                                       |sum(metric)|
// +---+---------------------------------------------+-----------+
// |001|[2016-05-01 10:50:00.0,2016-05-01 10:55:00.0]|45         |
// |001|[2016-05-01 10:55:00.0,2016-05-01 11:00:00.0]|12         |
// |003|[2016-05-01 10:55:00.0,2016-05-01 11:00:00.0]|13         |
// |001|[2016-05-01 11:00:00.0,2016-05-01 11:05:00.0]|11         |
// |002|[2016-05-01 10:50:00.0,2016-05-01 10:55:00.0]|100        |
// +---+---------------------------------------------+-----------+

Spark < 2.0

Lets start with example data:

import spark.implicits._  // import sqlContext.implicits._ in Spark < 2.0

val df = Seq(
  ("001", "event1", 10, "2016-05-01 10:50:51"),
  ("002", "event2", 100, "2016-05-01 10:50:53"),
  ("001", "event3", 20, "2016-05-01 10:50:55"),
  ("001", "event1", 15, "2016-05-01 10:51:50"),
  ("003", "event1", 13, "2016-05-01 10:55:30"),
  ("001", "event2", 12, "2016-05-01 10:57:00"),
  ("001", "event3", 11, "2016-05-01 11:00:01")
).toDF("KEY", "Event_Type", "metric", "Time")

I assume that event is identified by KEY. If this is not the case you can adjust GROUP BY / PARTITION BY clauses according to your requirements.

If you're interested in an aggregation with static window independent of data convert timestamps to a numeric data type and round

import org.apache.spark.sql.functions.{round, sum}

// cast string to timestamp_seconds
val ts = $"Time".cast("timestamp").cast("long")

// Round to 300 seconds interval
// In Spark >= 3.1 replace cast("timestamp") with 
val interval = (round(ts / 300L) * 300.0).cast("timestamp").alias("interval")

df.groupBy($"KEY", interval).sum("metric")

// +---+---------------------+-----------+
// |KEY|interval             |sum(metric)|
// +---+---------------------+-----------+
// |001|2016-05-01 11:00:00.0|11         |
// |001|2016-05-01 10:55:00.0|12         |
// |001|2016-05-01 10:50:00.0|45         |
// |003|2016-05-01 10:55:00.0|13         |
// |002|2016-05-01 10:50:00.0|100        |
// +---+---------------------+-----------+

If you're interested in a window relative to the current row use window functions:

import org.apache.spark.sql.expressions.Window

// Partition by KEY
// Order by timestamp 
// Consider window of -150 seconds to + 150 seconds relative to the current row
val w = Window.partitionBy($"KEY").orderBy("ts").rangeBetween(-150, 150)
df.withColumn("ts", ts).withColumn("window_sum", sum($"metric").over(w))

// +---+----------+------+-------------------+----------+----------+
// |KEY|Event_Type|metric|Time               |ts        |window_sum|
// +---+----------+------+-------------------+----------+----------+
// |003|event1    |13    |2016-05-01 10:55:30|1462092930|13        |
// |001|event1    |10    |2016-05-01 10:50:51|1462092651|45        |
// |001|event3    |20    |2016-05-01 10:50:55|1462092655|45        |
// |001|event1    |15    |2016-05-01 10:51:50|1462092710|45        |
// |001|event2    |12    |2016-05-01 10:57:00|1462093020|12        |
// |001|event3    |11    |2016-05-01 11:00:01|1462093201|11        |
// |002|event2    |100   |2016-05-01 10:50:53|1462092653|100       |
// +---+----------+------+-------------------+----------+----------+

For performance reasons this approach is useful only if data can partitioned into multiple separate groups. In Spark < 2.0.0 you'll also need HiveContext to make it work.

10465355
  • 4,481
  • 2
  • 20
  • 44
zero323
  • 322,348
  • 103
  • 959
  • 935
  • Hi i am using Java How to do the same operations in java and spark 2.1.0 – sathya Aug 08 '17 at 09:57
  • @sathiyarajan Should be pretty much the same, excluding minor syntax differences. – zero323 Aug 09 '17 at 11:32
  • 1
    Hi, then what is the approach for group bi-monthly, quarterly, 6-monthly? – ZK Zhao May 01 '18 at 08:36
  • https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.functions.window.html#pyspark.sql.functions.window thats the version of window function for pyspark – Yanis Jan 27 '23 at 10:48
1

For static boundary you can do following:

1) Transform (map, mapPartitions etc) Time value to form YYYY-MM-DD-hh-mm where mm is rolled up at 5 minutes level. e.g. 01, 02, 03, 05 becomes 05; 16,17,18,19,20 becomes 20

2) Perform groupBy or reduceBy with event_type and time and perform your aggregation(Sum) on metrics

3) Perform filter transformation to filter metrics > 5

You can write above in spark rdd or dataframe(sql) in almost same way.

For other type of boundary where 00-05, 01-06, 02-07 you should try looking into concept of sliding window. If your data ingestion use case fits streaming pattern then Spark Streaming API will be perfect otherwise you can find custom solution like this one: Apache Spark - Dealing with Sliding Windows on Temporal RDDs

Community
  • 1
  • 1
nir
  • 3,743
  • 4
  • 39
  • 63