0

I have table storing events occurring to users as shown in http://sqlfiddle.com/#!15/2b559/2/0

event_id(integer)
user_id(integer)
event_type(integer)
timestamp(timestamp)

A sample of the data looks as follows:

+-----------+----------+-------------+----------------------------+
| event_id  | user_id  | event_type  |         timestamp          |
+-----------+----------+-------------+----------------------------+
|        1  |       1  |          1  | January, 01 2015 00:00:00  |
|        2  |       1  |          1  | January, 10 2015 00:00:00  |
|        3  |       1  |          1  | January, 20 2015 00:00:00  |
|        4  |       1  |          1  | January, 30 2015 00:00:00  |
|        5  |       1  |          1  | February, 10 2015 00:00:00 |
|        6  |       1  |          1  | February, 21 2015 00:00:00 |
|        7  |       1  |          1  | February, 22 2015 00:00:00 |
+-----------+----------+-------------+----------------------------+

I would like to get, for each event, the number of events of the same user and the same event_type that occurred within 30 days before the event.

It should look like the following:

+-----------+----------+-------------+-----------------------------+-------+
| event_id  | user_id  | event_type  |         timestamp           | count |
+-----------+----------+-------------+-----------------------------+-------+
|        1  |       1  |          1  | January, 01 2015 00:00:00   |     1 |
|        2  |       1  |          1  | January, 10 2015 00:00:00   |     2 |
|        3  |       1  |          1  | January, 20 2015 00:00:00   |     3 |
|        4  |       1  |          1  | January, 30 2015 00:00:00   |     4 |
|        5  |       1  |          1  | February, 10 2015 00:00:00  |     3 |
|        6  |       1  |          1  | February, 21 2015 00:00:00  |     3 |
|        7  |       1  |          1  | February, 22 2015 00:00:00  |     4 |
+-----------+----------+-------------+-----------------------------+-------+

The table contains millions of rows so I cannot go with a correlated subquery as suggested by @jpw in the answers below.

So far I managed to get the total number of events that occurred before with the same user_id and same event_id by using the following query:

SELECT event_id, user_id,event_type,"timestamp",
COUNT(event_type) OVER w
FROM events
WINDOW w AS (PARTITION BY user_id,event_type ORDER BY timestamp
ROWS UNBOUNDED PRECEDING);

With the following result:

+-----------+----------+-------------+-----------------------------+-------+
| event_id  | user_id  | event_type  |         timestamp           | count |
+-----------+----------+-------------+-----------------------------+-------+
|        1  |       1  |          1  | January, 01 2015 00:00:00   |     1 |
|        2  |       1  |          1  | January, 10 2015 00:00:00   |     2 |
|        3  |       1  |          1  | January, 20 2015 00:00:00   |     3 |
|        4  |       1  |          1  | January, 30 2015 00:00:00   |     4 |
|        5  |       1  |          1  | February, 10 2015 00:00:00  |     5 |
|        6  |       1  |          1  | February, 21 2015 00:00:00  |     6 |
|        7  |       1  |          1  | February, 22 2015 00:00:00  |     7 |
+-----------+----------+-------------+-----------------------------+-------+

Do you know if there a way to change the window frame specification or the COUNT function so only the number of events which occurred within x days is returned?

In a second time, I would like to exclude duplicate events, i.e. same event_type and same timestamp.

Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
Julien Bourdon
  • 1,713
  • 17
  • 28
  • I just found a similar question here: http://stackoverflow.com/questions/8193688/postgresql-running-count-of-rows-for-a-query-by-minute. – Julien Bourdon Apr 05 '15 at 00:15

4 Answers4

2

I provided a more detailed answer plus fiddle under the duplicate question on dba.SE.

Basically:

CREATE INDEX events_fast_idx ON events (user_id, event_type, ts);

And either:

SELECT *
FROM   events e
    ,  LATERAL (
   SELECT count(*) AS ct
   FROM   events 
   WHERE  user_id    = e.user_id 
   AND    event_type = e.event_type
   AND    ts >= e.ts - interval '30 days'
   AND    ts <= e.ts
   ) ct
ORDER  BY event_id;

Or:

SELECT e.*, count(*) AS ct
FROM   events e
JOIN   events x USING (user_id, event_type)
WHERE  x.ts >= e.ts - interval '30 days'
AND    x.ts <= e.ts
GROUP  BY e.event_id
ORDER  BY e.event_id;
Community
  • 1
  • 1
Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
1

Maybe you already know how to solve this using a subquery and are asking specifically for a solution using a window function and if so this answer might be invalid for that reason, but if you're interest is in any possible solution then it's easy to solve this using a correlated subquery, although I suspect performance might be bad:

select 
  event_id, user_id,event_type,"timestamp", 
  (
    select count(distinct timestamp) 
    from events 
    where timestamp >= e.timestamp - interval '30 days'
    and timestamp <= e.timestamp
    and user_id = e.user_id 
    and event_type = e.event_type
    group by event_type, user_id
  ) as "count"
FROM events e
order by event_id;

Sample SQL Fiddle

jpw
  • 44,361
  • 6
  • 66
  • 86
  • Thanks for your answer. Yeah I should have been more precise in the question. The table contains millions of rows so your answer, while correct, will not be scale well with the volume :) +1 though – Julien Bourdon Apr 05 '15 at 01:13
  • I suspected that might be the case. Going to think a bit more about it. – jpw Apr 05 '15 at 01:26
0

Tis is clumsy, but it works. The CTE will probably perform worse than @jpw 's counted correlated subquery.

WITH ding AS (
  SELECT user_id, event_type , ztimestamp
        , row_number() OVER( PARTITION BY user_id, event_type
                             ORDER BY ztimestamp) AS rnk
  FROM events
  )
SELECT d1.*
        , 1+ d1.rnk - d0.rnk AS diff
FROM ding d1
JOIN ding d0 USING (user_id,event_type)
WHERE d1.ztimestamp >= d0.ztimestamp
AND d1.ztimestamp < d0.ztimestamp + '30 days'::interval
AND NOT EXISTS (
        SELECT *
        FROM ding nx
        WHERE nx.user_id = d0.user_id
        AND nx.event_type = d0.event_type
        AND nx.ztimestamp < d0.ztimestamp
        AND nx.ztimestamp > d1.ztimestamp - '30 days'::interval
        )
        ;
wildplasser
  • 43,142
  • 8
  • 66
  • 109
  • Indeed, The execution plan does not look that good :) -> http://i.imgur.com/zaCHBiM.png :) Took 693 ms to execute on a 1000 rows generated sample. @jpw's took 71 ms with the same sample. The one I will propose took 49 ms. – Julien Bourdon Apr 05 '15 at 13:00
0

I found a request that works:

SELECT toto.event_id,toto.user_id,toto.event_type,toto.lv as time,COUNT(*)
FROM(
    SELECT e.event_id, e.user_id,e.event_type,"timestamp",
    last_value("timestamp") OVER w as lv,
    unnest(array_agg(e."timestamp") OVER w) as agg
    FROM events e
    WINDOW w AS (PARTITION BY e.user_id,e.event_type ORDER BY e."timestamp"
    ROWS UNBOUNDED PRECEDING)) AS toto
WHERE toto.agg >= toto.lv - interval '30 days'
GROUP by event_id,user_id,event_type,lv;

On my dev machine with a sample of 1000 rows, it takes 49 ms to execute. With a 10000 rows sample, it takes 8277ms while @jpw's query takes 6720 ms, using an index on timestamp. With a sample of 50000 rows, it takes more than 100 s for both queries so I did not test :)

SQL Fiddle

Julien Bourdon
  • 1,713
  • 17
  • 28