32

I have the following table in my Postgresql 9.1 database:

select * from ro;
date       |  shop_id | amount 
-----------+----------+--------
2013-02-07 |     1001 |      3
2013-01-31 |     1001 |      2
2013-01-24 |     1001 |      1
2013-01-17 |     1001 |      5
2013-02-10 |     1001 |     10
2013-02-03 |     1001 |      4
2012-12-27 |     1001 |      6
2012-12-20 |     1001 |      8
2012-12-13 |     1001 |      4
2012-12-06 |     1001 |      3
2012-10-29 |     1001 |      3

I am trying to get a moving average comparing data against last 3 Thursdays without including the current Thursday. Here's my query:

select date, shop_id, amount, extract(dow from date),
avg(amount) OVER (PARTITION BY extract(dow from date) ORDER BY date DESC
                      ROWS BETWEEN 0 PRECEDING AND 2 FOLLOWING)                          
from ro
where extract(dow from date) = 4

This is the result given

date       |  shop_id | amount | date_part |        avg         
-----------+----------+--------+-----------+--------------------
2013-02-07 |     1001 |      3 |         4 | 2.0000000000000000
2013-01-31 |     1001 |      2 |         4 | 2.6666666666666667
2013-01-24 |     1001 |      1 |         4 | 4.0000000000000000
2013-01-17 |     1001 |      5 |         4 | 6.3333333333333333
2012-12-27 |     1001 |      6 |         4 | 6.0000000000000000
2012-12-20 |     1001 |      8 |         4 | 5.0000000000000000
2012-12-13 |     1001 |      4 |         4 | 3.5000000000000000
2012-12-06 |     1001 |      3 |         4 | 3.0000000000000000

I expect

date       |  shop_id | amount | date_part |        avg         
-----------+----------+--------+-----------+--------------------
2013-02-07 |     1001 |      3 |         4 | 2.6666666666666667
2013-01-31 |     1001 |      2 |         4 | 4.0000000000000000
2013-01-24 |     1001 |      1 |         4 | 6.3333333333333333
2013-01-17 |     1001 |      5 |         4 | 6.0000000000000000
2012-12-27 |     1001 |      6 |         4 | 5.0000000000000000
2012-12-20 |     1001 |      8 |         4 |
2012-12-13 |     1001 |      4 |         4 |
2012-12-06 |     1001 |      3 |         4 |
Glicious
  • 421
  • 1
  • 5
  • 13
  • +1 good question - Pg version, sample data, expected results. Thanks! Converted to SQLFiddle here: http://sqlfiddle.com/#!1/18891/1 – Craig Ringer Feb 07 '13 at 11:06
  • 5
    BTW, "date" is a terrible column name, since it's the name of a data type. Avoid using it. If you must use it, always qualify it with the table alias and double quote it, as shown here: http://sqlfiddle.com/#!1/18891/4 – Craig Ringer Feb 07 '13 at 11:08
  • Thanks Craig :) This is just a sample dataset from a very large table. I just like to get the query right first. – Glicious Feb 07 '13 at 12:00

2 Answers2

21

SQL Fiddle

select
    "date",
    shop_id,
    amount,
    extract(dow from date),
    case when
        row_number() over (order by date) > 3
        then
            avg(amount) OVER (
                ORDER BY date DESC
                ROWS BETWEEN 1 following AND 3 FOLLOWING
            )
        else null end
from (
    select *
    from ro
    where extract(dow from date) = 4
) s

What is wrong with the OP's query is the frame specification:

ROWS BETWEEN 0 PRECEDING AND 2 FOLLOWING

Other than that my query avoids unneeded computing by filtering Thursdays before applying the expensive window functions.

If it is necessary to partition by shop_id then obviously add the partition by shop_id to both functions, avg and row_number.

Clodoaldo Neto
  • 118,695
  • 26
  • 233
  • 260
  • 1
    While that seems to behave fine, it might be worth explaining what was wrong with the old one; why it failed. That'll help the OP and others learn, not just fix an immediate problem – Craig Ringer Feb 07 '13 at 12:16
  • Thanks Clodoalo :) It appears to me that the windowing function that I used ROWS BETWEEN 0 PRECEDING AND 2 FOLLOWING should have been ROWS BETWEEN 1 following AND 3 FOLLOWING Makes perfect sense! I'll try this on my larger dataset in a few hours and advise :) Thanks again! – Glicious Feb 07 '13 at 12:21
  • What happens if there's a Thursday skipped in the data? will the rolling avg include data from an unintentionally larger window (because the missing Thursday's data is implicitly 0) – Joey Baruch Jan 04 '22 at 23:23
13

I think a better answer might be:

SELECT date, shop_id, amount, 
    extract(dow from date) AS dow,
    CASE WHEN count(amount) OVER w = 3 
        THEN avg(amount) OVER w END AS average_amt             
FROM ro
WHERE extract(dow from date) = 4 
WINDOW w AS (ORDER BY date DESC ROWS BETWEEN 1 FOLLOWING AND 3 FOLLOWING)

I think it is cleaner to use the same window for both checking the number of rows in the window and taking the average. (This also saves two window aggregations, as can be seen in the original answer.)

Regarding the claim in the earlier answer that "my query avoids unneeded computing by filtering Thursdays before applying the expensive window functions", this also applies to the query suggested by the OP and to my query, as appending EXPLAIN to either shows.

Ian Gow
  • 3,098
  • 1
  • 25
  • 31
  • This was such an old post, but thank you for your suggested solution. :-) – Glicious Jul 17 '16 at 11:34
  • Good solution. The SQL is clear, but for readers that like citations, [this is from Jul 05 '17](https://www.compose.com/articles/metrics-maven-calculating-a-moving-average-in-postgresql/) (older than this post). The technic can be named "moving window".. A [good tutorial about how to specify your windows](https://momjian.us/main/writings/pgsql/window.pdf). – Peter Krauss Dec 16 '19 at 03:05
  • What happens if there's a Thursday skipped in the data? will the rolling avg include data from an unintentionally larger window (because the missing Thursday's data is implicitly 0) – Joey Baruch Jan 04 '22 at 23:23