4

I have a table like this.

ID (integer)
event_name(varchar(20))
event_date(timestamp)

some sample data is given below.

ID         event_date                          event_name
101        2013-04-24 18:33:37.694818          event_A
102        2013-04-24 20:34:37.000000          event_B
103        2013-04-24 20:40:37.000000          event_A
104        2013-04-25 01:00:00.694818          event_A
105        2013-04-25 12:00:15.694818          event_A
106        2013-04-26 00:56:10.800000          event_A
107        2013-04-27 12:00:15.694818          event_A
108        2013-04-27 12:00:15.694818          event_B

I need to generate window based report. Here window represent group of rows. Eg: if I choose window size of 2, I need to show the total counting of each event for two days successively, ie the same day and the previous day. If i choose window size 3 , I need to generate count of each event for three successive days.

so If 2 day window is selected, the result should be something like below.

Date                                       Count_eventA                 Count_eventB
2013-04-27 (this counts sum of 27th, 26th)       2                           1 
2013-04-26 (this counts sum of 26th, 25th)       3                           0
2013-04-25 (this counts sum of 25th, 24th)       4                           1
2013-04-24 (this counts sum of 24th      )       2                           1

I have read window function in postgres. Can someone guide me how to write a sql query for this report!

Anant
  • 3,047
  • 2
  • 27
  • 33

1 Answers1

6

You want to use the count aggregate as a window function, eg count(id) over (partition by event_date rows 3 preceeding)... but it's greatly complicated by the nature of your data. You're storing timestamps, not just dates, and you want to group by day not by number of previous events. To top it all off, you want to cross-tabulate the results.

If PostgreSQL supported RANGE in window functions this would be considerably simpler than it is. As it is, you have to do it the hard way.

You can then filter that through a window to get the per-event per-day lagged counts ... except that your event days aren't contiguous and unfortunately PostgreSQL window functions only support ROWS, not RANGE, so you have to join across a generated series of dates first.

WITH
/* First, get a listing of event counts by day */
event_days(event_name, event_day, event_day_count) AS (
        SELECT event_name, date_trunc('day', event_date), count(id)
        FROM Table1
        GROUP BY event_name, date_trunc('day', event_date)
        ORDER BY date_trunc('day', event_date), event_name
),
/* 
 * Then fill in zeros for any days within the range that didn't have any events.
 * If PostgreSQL supported RANGE windows, not just ROWS, we could get rid of this/
 */
event_days_contiguous(event_name, event_day, event_day_count) AS (
        SELECT event_names.event_name, gen_day, COALESCE(event_days.event_day_count,0)
        FROM generate_series( (SELECT min(event_day)::date FROM event_days), (SELECT max(event_day)::date FROM event_days), INTERVAL '1' DAY ) gen_day
        CROSS JOIN (SELECT DISTINCT event_name FROM event_days) event_names(event_name)
        LEFT OUTER JOIN event_days ON (gen_day = event_days.event_day AND event_names.event_name = event_days.event_name)
),
/*
 * Get the lagged counts by using the sum() function over a row window...
 */
lagged_days(event_name, event_day_first, event_day_last, event_days_count) AS (
        SELECT event_name, event_day, first_value(event_day) OVER w, sum(event_day_count) OVER w
        FROM event_days_contiguous
        WINDOW w AS (PARTITION BY event_name ORDER BY event_day ROWS 1 PRECEDING)
)
/* Now do a manual pivot. For arbitrary column counts use an external tool
 * or check out the 'crosstab' function in the 'tablefunc' contrib module 
 */
SELECT d1.event_day_first, d1.event_days_count AS "Event_A", d2.event_days_count AS "Event_B"
FROM lagged_days d1
INNER JOIN lagged_days d2 ON (d1.event_day_first = d2.event_day_first AND d1.event_name = 'event_A' AND d2.event_name = 'event_B')
ORDER BY d1.event_day_first;

Output with the sample data:

    event_day_first     | Event_A | Event_B 
------------------------+---------+---------
 2013-04-24 00:00:00+08 |       2 |       1
 2013-04-25 00:00:00+08 |       4 |       1
 2013-04-26 00:00:00+08 |       3 |       0
 2013-04-27 00:00:00+08 |       2 |       1
(4 rows)

You can potentially make the query faster but much uglier by combining the three CTE clauses into a nested query using FROM (SELECT...) and wrapping them in a view instead of a CTE for use from the outer query. This will allow Pg to "push down" predicates into the queries, greatly reducing the data you have to work with when querying subsets of the data.

SQLFiddle doesn't seem to be working at the moment, but here's the demo setup I used:

CREATE TABLE Table1 
(id integer primary key, "event_date" timestamp not null, "event_name" text);

INSERT INTO Table1
("id", "event_date", "event_name")
VALUES
(101, '2013-04-24 18:33:37', 'event_A'),
(102, '2013-04-24 20:34:37', 'event_B'),
(103, '2013-04-24 20:40:37', 'event_A'),
(104, '2013-04-25 01:00:00', 'event_A'),
(105, '2013-04-25 12:00:15', 'event_A'),
(106, '2013-04-26 00:56:10', 'event_A'),
(107, '2013-04-27 12:00:15', 'event_A'),
(108, '2013-04-27 12:00:15', 'event_B');

I changed the ID of the last entry from 107 to 108, as I suspect that was just an error in your manual editing.

Here's how to express it as a view instead:

CREATE VIEW lagged_days AS
SELECT event_name, event_day AS event_day_first, sum(event_day_count) OVER w AS event_days_count 
FROM (
        SELECT event_names.event_name, gen_day, COALESCE(event_days.event_day_count,0)
        FROM generate_series( (SELECT min(event_date)::date FROM Table1), (SELECT max(event_date)::date FROM Table1), INTERVAL '1' DAY ) gen_day
        CROSS JOIN (SELECT DISTINCT event_name FROM Table1) event_names(event_name)
        LEFT OUTER JOIN (
                SELECT event_name, date_trunc('day', event_date), count(id)
                FROM Table1
                GROUP BY event_name, date_trunc('day', event_date)
                ORDER BY date_trunc('day', event_date), event_name
        ) event_days(event_name, event_day, event_day_count)
        ON (gen_day = event_days.event_day AND event_names.event_name = event_days.event_name)
) event_days_contiguous(event_name, event_day, event_day_count)
WINDOW w AS (PARTITION BY event_name ORDER BY event_day ROWS 1 PRECEDING);

You can then use the view in whatever crosstab queries you want to write. It'll work with the prior hand-crosstab query:

SELECT d1.event_day_first, d1.event_days_count AS "Event_A", d2.event_days_count AS "Event_B"
FROM lagged_days d1
INNER JOIN lagged_days d2 ON (d1.event_day_first = d2.event_day_first AND d1.event_name = 'event_A' AND d2.event_name = 'event_B')
ORDER BY d1.event_day_first;

... or using crosstab from the tablefunc extension, which I'll let you study up on.

For a laugh, here's the explain on the above view-based query: http://explain.depesz.com/s/nvUq

Craig Ringer
  • 307,061
  • 76
  • 688
  • 778
  • thank you very much for the answer. Let me check it with my db. – Anant Jun 10 '13 at 05:39
  • @Anant Performance is likely to be pretty amazingly bad on large amounts of data. The view-based approach *might* perform better, but probably only if you have an index on `event_name` and possibly also `event_date` or `date_trunc('day', event_date)` – Craig Ringer Jun 10 '13 at 06:04