25

I have unevenly distributed data (wrt date) for a few years (2003-2008). I want to query data for a given set of start and end date, grouping the data by any of the supported intervals (day, week, month, quarter, year) in PostgreSQL 8.3.

The problem is that some of the queries give results continuous over the required period, as this one:

select to_char(date_trunc('month',date), 'YYYY-MM-DD'), count(distinct post_id) 
from some_table
where category_id = 1
and entity_id = 77
and entity2_id = 115 
and date <= '2008-12-06'
and date >= '2007-12-01'
group by date_trunc('month',date)
order by date_trunc('month',date);

  to_char   | count 
------------+-------
 2007-12-01 |    64
 2008-01-01 |    31
 2008-02-01 |    14
 2008-03-01 |    21
 2008-04-01 |    28
 2008-05-01 |    44
 2008-06-01 |   100
 2008-07-01 |    72
 2008-08-01 |    91
 2008-09-01 |    92
 2008-10-01 |    79
 2008-11-01 |    65
(12 rows)

But some of them miss some intervals because there is no data present, as this one:

select to_char(date_trunc('month',date), 'YYYY-MM-DD'), count(distinct post_id) 
from some_table
where category_id=1
and entity_id = 75
and entity2_id = 115 
and date <= '2008-12-06'
and date >= '2007-12-01'
group by date_trunc('month',date)
order by date_trunc('month',date);
    
  to_char   | count 
------------+-------

 2007-12-01 |     2
 2008-01-01 |     2
 2008-03-01 |     1
 2008-04-01 |     2
 2008-06-01 |     1
 2008-08-01 |     3
 2008-10-01 |     2
(7 rows)

where the required resultset is:

  to_char   | count 
------------+-------
 2007-12-01 |     2
 2008-01-01 |     2
 2008-02-01 |     0
 2008-03-01 |     1
 2008-04-01 |     2
 2008-05-01 |     0
 2008-06-01 |     1
 2008-07-01 |     0
 2008-08-01 |     3
 2008-09-01 |     0
 2008-10-01 |     2
 2008-11-01 |     0
(12 rows)

A count of 0 for missing entries.

I have seen earlier discussions on Stack Overflow but they don't solve my problem it seems, since my grouping period is one of (day, week, month, quarter, year) and decided on runtime by the application. So an approach like left join with a calendar table or sequence table will not help I guess.

My current solution to this is to fill in these gaps in Python (in a Turbogears App) using the calendar module.

Is there a better way to do this?

Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
JV.
  • 2,658
  • 4
  • 24
  • 36

3 Answers3

36

This question is old. But since fellow users picked it as master for a new duplicate I am adding a proper answer.

Proper solution

SELECT *
FROM  (
   SELECT day::date
   FROM   generate_series(timestamp '2007-12-01'
                        , timestamp '2008-12-01'
                        , interval  '1 month') day
   ) d
LEFT   JOIN (
   SELECT date_trunc('month', date_col)::date AS day
        , count(*) AS some_count
   FROM   tbl
   WHERE  date_col >= date '2007-12-01'
   AND    date_col <= date '2008-12-06'
-- AND    ... more conditions
   GROUP  BY 1
   ) t USING (day)
ORDER  BY day;

Use LEFT JOIN, of course.

generate_series() can produce a table of timestamps on the fly, and very fast. See:

It's typically faster to aggregate before you join. Related answer with test case in a fiddle:

Cast the timestamp to date (::date) for a basic format. For more use to_char().

GROUP BY 1 is syntax shorthand to reference the first output column. Could be GROUP BY day as well, but that might conflict with an existing column of the same name. Or GROUP BY date_trunc('month', date_col)::date but that's too long for my taste.

Works with the available interval arguments for date_trunc().

count() never returns NULL - 0 for no rows - but the LEFT JOIN does.
To return 0 instead of NULL in the outer SELECT, use COALESCE(some_count, 0) AS some_count. The manual.

For a more generic solution or arbitrary time intervals see:

Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
  • is it possible to accomplish this with CTEs? – zam6ak Jul 11 '16 at 17:14
  • @zam6ak: Sure, in various ways. But why would you want to? A CTE would be slower. Use CTEs only when you need them in Postgres. If you have a use case, I suggest asking a new question. You can always link to this answer for context and drop a comment here which links to the related question and notifies me. – Erwin Brandstetter Jul 12 '16 at 01:01
  • Thanks, I posted a new [question](http://stackoverflow.com/questions/38332433/pg-how-to-include-missing-data-for-multiple-groupings-including-time-span) – zam6ak Jul 12 '16 at 14:59
  • This query is exactly what I was looking for, thank you for the answer. Is there a way to have some_count return 0 and not empty data? – Paul Nov 23 '16 at 16:04
  • @Paul: I added instructions in the answer. – Erwin Brandstetter Nov 23 '16 at 16:34
  • @ErwinBrandstetter I am using `COALESCE(count(*), 0) AS click_count` however it is still an empty value. Am I missing something? – Paul Nov 23 '16 at 18:47
  • 1
    @Paul: Yes, you are missing the essential part. `COALESCE(count(*), 0)` *never* makes sense. I wrote: *in the outer `SELECT` use `COALESCE(some_count, 0) AS some_count`*. `count(*)` happens in the subquery before the `LEFT JOIN`. – Erwin Brandstetter Nov 23 '16 at 18:58
21

You can create the list of all first days of the last year (say) with

select distinct date_trunc('month', (current_date - offs)) as date 
from generate_series(0,365,28) as offs;
          date
------------------------
 2007-12-01 00:00:00+01
 2008-01-01 00:00:00+01
 2008-02-01 00:00:00+01
 2008-03-01 00:00:00+01
 2008-04-01 00:00:00+02
 2008-05-01 00:00:00+02
 2008-06-01 00:00:00+02
 2008-07-01 00:00:00+02
 2008-08-01 00:00:00+02
 2008-09-01 00:00:00+02
 2008-10-01 00:00:00+02
 2008-11-01 00:00:00+01
 2008-12-01 00:00:00+01

Then you can join with that series.

Martin v. Löwis
  • 124,830
  • 17
  • 198
  • 235
  • I tried this : http://pastebin.com/f6f44e58b still 7 rows. Is there something wrong with the join – JV. Dec 06 '08 at 12:45
  • Though I think the (start, stop , step) have to be carefully filled before doing the query and i skeptic that for a long enough period the step of 28 can fail, what do you think? – JV. Dec 06 '08 at 13:30
  • 1
    Stepping in steps of 28 days will always hit every month at least once, since the shortest months is 28 days; feel free to use any smaller step, though. It will hit the same month twice occasionally, hence the DISTINCT filter. – Martin v. Löwis Dec 06 '08 at 13:50
  • I wish you hadn't deleted the pastebin for the correct answer :( – slashdottir Feb 04 '17 at 05:25
0

You could create a temporary table at runtime and left join on that. That seems to make the most sense.

ng.mangine
  • 2,947
  • 1
  • 17
  • 7