How to find longest sequence of consecutive dates?

Question

I have a database with time visit in timestamp like this

ID, time
1, 1493596800
1, 1493596900
1, 1493432800
2, 1493596800
2, 1493596850
2, 1493432800

I use spark SQL and I need to have the longest sequence of consecutives dates for each ID like

ID, longest_seq (days)
1, 2
2, 5
3, 1

I tried to adapt this answer Detect consecutive dates ranges using SQL to my case but I didn't manage to have what I expect.

 SELECT ID, MIN (d), MAX(d)
    FROM (
      SELECT ID, cast(from_utc_timestamp(cast(time as timestamp), 'CEST') as date) AS d, 
                ROW_NUMBER() OVER(
         PARTITION BY ID ORDER BY cast(from_utc_timestamp(cast(time as timestamp), 'CEST') 
                                                           as date)) rn
      FROM purchase
      where ID is not null
      GROUP BY ID, cast(from_utc_timestamp(cast(time as timestamp), 'CEST') as date) 
    )
    GROUP BY ID, rn
    ORDER BY ID

If someone has some clue on how to fix this request, or what's wrong in it, I would appreciate the help Thanks

[EDIT] A more explicit input /output

ID, time
1, 1
1, 2
1, 3
2, 1
2, 3
2, 4
2, 5
2, 10
2, 11
3, 1
3, 4
3, 9
3, 11

The result would be :

ID, MaxSeq (in days)
1,3
2,3
3,1

All the visits are in timestamp, but I need consecutives days, then each visit by day is counted once by day

score 7 · Answer 1 · answered Jun 22 '17 at 01:01

My answer below is adapted from https://dzone.com/articles/how-to-find-the-longest-consecutive-series-of-even for use in Spark SQL. You'll have wrap the SQL queries with:

spark.sql("""
SQL_QUERY
""")

So, for the first query:

CREATE TABLE intermediate_1 AS
SELECT 
    id,
    time,
    ROW_NUMBER() OVER (PARTITION BY id ORDER BY time) AS rn,
    time - ROW_NUMBER() OVER (PARTITION BY id ORDER BY time) AS grp
FROM purchase

This will give you:

id, time, rn, grp
1,  1,    1,  0
1,  2,    2,  0
1,  3,    3,  0
2,  1,    1,  0
2,  3,    2,  1
2,  4,    3,  1
2,  5,    4,  1
2,  10,   5,  5
2,  11,   6,  5
3,  1,    1,  0
3,  4,    2,  2
3,  9,    3,  6
3,  11,   4,  7

We can see that the consecutive rows have the same grp value. Then we will use GROUP BY and COUNT to get the number of consecutive time.

CREATE TABLE intermediate_2 AS
SELECT 
    id,
    grp,
    COUNT(*) AS num_consecutive
FROM intermediate_1
GROUP BY id, grp

This will return:

id, grp, num_consecutive
1,  0,   3
2,  0,   1
2,  1,   3
2,  5,   2
3,  0,   1
3,  2,   1
3,  6,   1
3,  7,   1

Now we just use MAX and GROUP BY to get the max number of consecutive time.

CREATE TABLE final AS
SELECT 
    id,
    MAX(num_consecutive) as max_consecutive
FROM intermediate_2
GROUP BY id

Which will give you:

id, max_consecutive
1,  3
2,  3
3,  1

Hope this helps!

some of the approached you mentioned was really helpful to solve my problem. — Balaji Reddy, Apr 07 '19 at 17:26

Jacek Laskowski · Accepted Answer · 2018-11-21T20:31:39.793

That's the case for my beloved window aggregate functions!

I think the following example could help you out (at least to get started).

The following is the dataset I use. I translated your time (in longs) to numeric time to denote the day (and avoid messing around with timestamps in Spark SQL which could make the solution harder to comprehend...possibly).

In the below visit dataset, time column represents the days between dates so 1s one by one represent consecutive days.

scala> visits.show
+---+----+
| ID|time|
+---+----+
|  1|   1|
|  1|   1|
|  1|   2|
|  1|   3|
|  1|   3|
|  1|   3|
|  2|   1|
|  3|   1|
|  3|   2|
|  3|   2|
+---+----+

Let's define the window specification to group id rows together.

import org.apache.spark.sql.expressions.Window
val idsSortedByTime = Window.
  partitionBy("id").
  orderBy("time")

With that you rank the rows and count rows with the same rank.

val answer = visits.
  select($"id", $"time", rank over idsSortedByTime as "rank").
  groupBy("id", "time", "rank").
  agg(count("*") as "count")
scala> answer.show
+---+----+----+-----+
| id|time|rank|count|
+---+----+----+-----+
|  1|   1|   1|    2|
|  1|   2|   3|    1|
|  1|   3|   4|    3|
|  3|   1|   1|    1|
|  3|   2|   2|    2|
|  2|   1|   1|    1|
+---+----+----+-----+

That appears (very close?) to a solution. You seem done!

score 0 · Answer 3 · answered Nov 22 '18 at 06:22

Using spark.sql and with intermediate tables

scala> val df = Seq((1, 1),(1, 2),(1, 3),(2, 1),(2, 3),(2, 4),(2, 5),(2, 10),(2, 11),(3, 1),(3, 4),(3, 9),(3, 11)).toDF("id","time")
df: org.apache.spark.sql.DataFrame = [id: int, time: int]

scala> df.createOrReplaceTempView("tb1")

scala> spark.sql(""" with tb2(select id,time, time-row_number() over(partition by id order by time) rw1 from tb1), tb3(select id,count(rw1) rw2 from tb2 group by id,rw1) select id, rw2 from tb3 where (id,rw2) in (select id,max(rw2) from tb3 group by id) group by id, rw2 """).show(false)
+---+---+
|id |rw2|
+---+---+
|1  |3  |
|3  |1  |
|2  |3  |
+---+---+


scala>

score 0 · Answer 4 · answered Jan 30 '23 at 22:34

Solution using DataFrame API:

import org.apache.spark.sql.functions._
import spark.implicits._

val df1 = Seq((1, 1),(1, 2),(1, 3),(2, 1),(2, 3),(2, 4),(2, 5),(2, 10),(2, 11),(3, 1),(3, 4),(3, 9),(3, 11)).toDF("ID","time")

df1.show(false)
df1.printSchema()

val w = Window.partitionBy("ID").orderBy("time")
val df2 = df1.withColumn("rank", col("time") - row_number().over(w))
  .groupBy("ID", "rank")
  .agg(count("rank").alias("count"))
  .groupBy("ID")
  .agg(max("count").alias("time"))
  .orderBy("ID")

df2.show(false)

Console output:

+---+----+
|ID |time|
+---+----+
|1  |1   |
|1  |2   |
|1  |3   |
|2  |1   |
|2  |3   |
|2  |4   |
|2  |5   |
|2  |10  |
|2  |11  |
|3  |1   |
|3  |4   |
|3  |9   |
|3  |11  |
+---+----+

root
 |-- ID: integer (nullable = false)
 |-- time: integer (nullable = false)

+---+----+
|ID |time|
+---+----+
|1  |3   |
|2  |3   |
|3  |1   |
+---+----+

How to find longest sequence of consecutive dates?

4 Answers4

Linked