Count over rows in previous time range partitioned by a specific column

Question

My dataset consists of daily (actually business days) timeseries for different companies from different industries and I work with PostgreSQL. I have an indicator variable in my dataset taking values 1, -1 and most of the times 0. For better readability of the question I refer to days where the indicator variable is unequal to zero as indicator event.

So for all indicator events that are preceded by another indicator event for the same industry in the previous three business days, the indicator variable shall be updated to zero.

We can think of the following example dataset:

day              company    industry       indicator
2012-01-12       A          financial      1
2012-01-12       B          consumer       0 
2012-01-13       A          financial      1 
2012-01-13       B          consumer       -1
2012-01-16       A          financial      0 
2012-01-16       B          consumer       0 
2012-01-17       A          financial      0
2012-01-17       B          consumer       0
2012-01-17       C          consumer       0
2012-01-18       A          financial      0
2012-01-18       B          consumer       0
2012-01-18       C          consumer       1

So the indicator values that shall be updated to zero are on 2012-01-13 the entry for company A, and on 2012-01-18 the entry for company C, because they are preceded by another indicator event in the same industry within 3 business days.

I tried to accomplish it in the following way:

UPDATE test SET indicator = 0 
WHERE (day, industry) IN (
SELECT day, industry
  FROM (
       SELECT industry, day,
       COUNT(CASE WHEN indicator <> 0 THEN 1 END) 
          OVER (PARTITION BY industry ORDER BY day 
                ROWS BETWEEN 3 PRECEDING AND CURRENT ROW) As cnt
       FROM test
       ) alias
  WHERE cnt >= 2)

My idea was to count the indicator events for the current day and the 3 preceding days partitioned by industry. If it counts more than 1, it updates the indicator value to zero.

The weak spot is, that so far it counts over the three preceding rows (partitioned by industry) instead of the three preceding business days. So in the example data, it is not able to update company C on 2012-01-18, because it counts over the last three rows where industry = consumer instead of counting over all rows where industry=consumer for the last three business days.

I tried different methods like adding another subquery in the third last line of the code or adding a WHERE EXISTS - clause after the third last line, to ensure that the code counts over the three preceding dates. But nothing worked. I really don't know out how to do that (I just learn to work with PostgreSQL).

Do you have any ideas how to fix it?

Or maybe I am thinking in a completely wrong direction and you know another approach how to solve my problem?

What happens if there is an indicator for the same industry every three business days? Do you reset every indicator but the first? Shouldn't you have a static grid instead? Like, "only pick the first incident from Mon. - Wed. and the first from Thu. - Fri.". — Erwin Brandstetter, Mar 06 '14 at 23:48
Yes exactly, in that case I would reset every indicator but the first. Therefore your solution is working perfectly, thank you very much. @ErwinBrandstetter — user3319629, Mar 07 '14 at 13:26

score 1 · Accepted Answer · edited Jun 20 '20 at 09:12

DB design

Fist off, your table should be normalized. industry should be a small foreign key column (typically integer) referencing industry_id of an industry table. Maybe you have that already and only simplified for the sake of the question. Your actual table definition would go a long way.

Since rows with an indicator are rare but highly interesting, create a (possibly "covering") partial index to make any solution faster:

CREATE INDEX tbl_indicator_idx ON tbl (industry, day)
WHERE  indicator <> 0;

Equality first, range last.
Assuming that indicator is defined NOT NULL. If industry was an integer, this index would be perfectly efficient.

Query

This query identifies rows to be reset:

WITH x AS (               -- only with indicator
   SELECT DISTINCT industry, day
   FROM   tbl t 
   WHERE  indicator <> 0
   )
SELECT industry, day
FROM  (
   SELECT i.industry, d.day, x.day IS NOT NULL AS incident
        , count(x.day) OVER (PARTITION BY industry ORDER BY day_nr
                             ROWS BETWEEN 3 PRECEDING AND CURRENT ROW) AS ct
   FROM  (
      SELECT *, row_number() OVER (ORDER BY d.day) AS day_nr
      FROM  (
         SELECT generate_series(min(day), max(day), interval '1d')::date AS day
         FROM   x
         ) d
      WHERE  extract('ISODOW' FROM d.day) < 6
      ) d
   CROSS  JOIN (SELECT DISTINCT industry FROM x) i
   LEFT   JOIN x USING (industry, day)
   ) sub
WHERE  incident
AND    ct > 1
ORDER  BY 1, 2;

SQL Fiddle.

ISODOW as extract() parameter is convenient to truncate weekends.

Integrate this in your UPDATE:

WITH x AS (               -- only with indicator
   SELECT DISTINCT industry, day
   FROM   tbl t 
   WHERE  indicator <> 0
   )
UPDATE tbl t
SET    indicator = 0 
FROM  (
   SELECT i.industry, d.day, x.day IS NOT NULL AS incident
        , count(x.day) OVER (PARTITION BY industry ORDER BY day_nr
                             ROWS BETWEEN 3 PRECEDING AND CURRENT ROW) AS ct
   FROM  (
      SELECT *, row_number() OVER (ORDER BY d.day) AS day_nr
      FROM  (
         SELECT generate_series(min(day), max(day), interval '1d')::date AS day
         FROM   x
         ) d
      WHERE  extract('isodow' FROM d.day) < 6
      ) d
   CROSS  JOIN (SELECT DISTINCT industry FROM x) i
   LEFT   JOIN x USING (industry, day)
   ) u
WHERE  u.incident
AND    u.ct > 1
AND    t.industry = u.industry
AND    t.day = u.day;

This should be substantially faster than your solution with correlated subqueries and a function call for every row. Even if that's based on my own previous answer, it's not perfect for this case.

Thank you very much for your great answer, and also for your advice how I can further improve my work. I really appreciate it!! @ErwinBrandstetter — user3319629, Mar 07 '14 at 13:25

score 0 · Answer 2 · edited May 23 '17 at 12:02

In the meantime I found one possible solution myself (I hope that this isn't against the etiquette of the forum).

Please note that this is only one possible solution. You are very welcome to comment it or to develop improvements if you want to.

For the first part, the function addbusinessdays which can add (or subtract) business day to a given date, I am referring to: http://osssmb.wordpress.com/2009/12/02/business-days-working-days-sql-for-postgres-2/ (I just slightly modified it because I don't care for holidays, just for weekends)

    CREATE OR REPLACE FUNCTION addbusinessdays(date, integer)
      RETURNS date AS
    $BODY$ 
    with alldates as (
        SELECT i,
        $1 + (i * case when $2 < 0 then -1 else 1 end) AS date
        FROM generate_series(0,(abs($2) + 5)*2) i
    ),
    days as (
        select i, date, extract('dow' from date) as dow
        from alldates
    ),
    businessdays as (
        select i, date, d.dow from days d
        where d.dow between 1 and 5
        order by i
    )

    select date from businessdays where
            case when $2 > 0 then date >=$1 when $2 < 0 then date <=$1 else date =$1 end
        limit 1
        offset abs($2)
    $BODY$
      LANGUAGE 'sql' VOLATILE
      COST 100;
    ALTER FUNCTION addbusinessdays(date, integer) OWNER TO postgres;

For the second part, I am referring to this related question, where I am applying Erwin Brandstetter's correlated subquery approach: Window Functions or Common Table Expressions: count previous rows within range

    UPDATE test SET indicator = 0 
    WHERE (day, industry) IN (
    SELECT day, industry
      FROM (
           SELECT industry, day,
                  (SELECT COUNT(CASE WHEN indicator <> 0 THEN 1 END) 
                   FROM test t1
                   WHERE t1.industry = t.industry
                   AND t1.day between addbusinessdays(t.day,-3) and t.day) As cnt
           FROM test t
           ) alias
      WHERE cnt >= 2)

Count over rows in previous time range partitioned by a specific column

2 Answers2

DB design

Query