1

I have written a small PostgreSQL query that helps me total amount of jobs executed per hourly intervals in every day within two certain dates -e.g. all the jobs executed between February 2, 2012 and March 3, 2012 hour by hour starting with the hour given in February 2 and ending with the hour given in March 3- I have noticed that this query doesn't print the rows with 0 count -no job executed within that time interval e.g. at February 21, 2012 between 5 and 6pm-. How can I make this also return results(rows) with 0 count? The code is as below:

SELECT date_trunc('hour', executiontime), count(executiontime)
  FROM mytable
 WHERE executiontime BETWEEN '2011-2-2 0:00:00' AND '2012-3-2 5:00:00' 
 GROUP BY date_trunc('hour', executiontime)
 ORDER BY date_trunc('hour', executiontime) ASC;

Thanks in advance.

Yaroslav
  • 6,476
  • 10
  • 48
  • 89
sm90901
  • 245
  • 5
  • 17
  • You'll need a numbers table to generate rows for each hour, and then outer join to the `mytable` to get the count. – dotjoe Aug 01 '12 at 14:19

2 Answers2

1
        -- CTE to the rescue!!!
WITH cal AS (
        SELECT generate_series('2012-02-02 00:00:00'::timestamp , '2012-03-02 05:00:00'::timestamp , '1 hour'::interval) AS stamp
        )
, qqq AS (
        SELECT date_trunc('hour', executiontime) AS stamp
        , count(*) AS zcount
        FROM mytable
        GROUP BY date_trunc('hour', executiontime)
        )
SELECT cal.stamp
        , COALESCE (qqq.zcount, 0) AS zcount
FROM cal
LEFT JOIN qqq ON cal.stamp = qqq.stamp
ORDER BY stamp ASC
        ;
wildplasser
  • 43,142
  • 8
  • 66
  • 109
  • This gives the following syntax error ERROR: syntax error at or near "WITH cal" LINE 8: WITH cal AS ( despite being the same as the CTE examples in the PostgreSQL documentation for some reason. – sm90901 Aug 02 '12 at 05:21
  • I have 8.3, that should be the problem. Thanks. – sm90901 Aug 02 '12 at 08:08
  • Standard trick is to replace the CTE's by (temporary) views and refer to these. But it will lose a lot of its elegance that way. BTW: I would suggest to upgrade to 9.1.x. It's only an hour of work or so, and it is worth it. – wildplasser Aug 02 '12 at 08:45
  • Then I'll install 9.1.x and try out the answer. By the way, is this suitable for very large scale data -because I'm looking at ~2.5 million total queries which are ~10.5k rows (and this number may increase) if I retrieve them hour by hour every single day- and my initial code retrieves the entire data (a slightly longer timespan than the datetime parameters given in my question, the timestamps are parametric and any two data intervals can be chosen with user input from the Qt front-end.- in about 5 seconds. – sm90901 Aug 02 '12 at 10:14
  • Don't forget to backup (you need to restore from backup to do a migration!) Performance is not an issue; the planner optimiser treats views and CTEs as part of the main query, and reshuffles them happily, as if they were ordinary subqueries. With the correct structure and tuning subsecond timing would be possible for (MegaRow * KiloRow) joins. BTW: 1MRow is not large; that amount of data will still fit in the cache, so a typical query in a "warm" database would need virtually no disk-IO) – wildplasser Aug 02 '12 at 10:17
  • I tried your query and it worked marvels. Yet I have a question, now it returns the entire data -from Feb 2 2011 to Apr 1 2012- in about 21 seconds and performance of all queries I tried have decreased slightly. I'm not complaining about your solution but can you give me some query tuning directions on this answer? -I'm just learning the concept of tuning and the topic really caught my interest, I would like to practice it more in order to learn it- – sm90901 Aug 02 '12 at 13:01
  • http://www.varlena.com/GeneralBits/Tidbits/perf.html The essence of all tuning is memory vs disk. {estimated_cache_size, work_mem, random_page_cost} are the place to start. Put "EXPLAIN ANALYZE" in front of your query to see what it attempts to do. (maybe you should start a new question for a specific tuning problem) Please note that your specific query *is* expensive since it aggregates, and needs *all* the tuples to do that. – wildplasser Aug 02 '12 at 13:25
  • Thanks, I'll have a look at this tonight and ask any possible questions as a seperate thread. – sm90901 Aug 02 '12 at 13:27
  • here is a nice report on experiments with tuning http://stackoverflow.com/a/10657906/905902 (shameless self-plug ;-) – wildplasser Aug 02 '12 at 13:42
  • Thanks for the additional info. By the way, I'm not sure if I should open another question but the code only works if the passed timespan parameters are rounded up hourly values like 17:00:00. But I should be able to process inputs like 02:21:16 or 19:15:59 and the row counts should also be concurrent -e.g. if user gave 02:21:16 as the starting time, the 02:00:00-03:00:00 hour interval should only return the count of queries between 02:21:16 and 03:00:00 instead of 02:00:00- how can I fix that one? I think the solution lies somewhere within the qqq AS block but I'm not sure. – sm90901 Aug 03 '12 at 10:06
  • and the ending time interval should be between 19:00:00 and 19:15:59 for the ending day, forgot to add that one. – sm90901 Aug 03 '12 at 10:12
  • Yes, you'd have to add a `WHERE executiontime >= '2012-02-02 02:21:16'` clause in the qqq leg. You'l als have to alter the start value for the sequence, but this could also be done by putting a where in the final query. Both starting values would need to be rounded down modulo 'hour') BTW: IIRC 8.3 still had dates stored in floating point types by default. There might be a loss of resolution in the older data. – wildplasser Aug 03 '12 at 10:29
  • All right I added `WHERE executiontime between '2012-02-02 02:11:16' and '2012-02-02 21:11:33'` to the `qqq AS` part and added `WHERE cal.stamp between '2012-02-02 02:11:16' and '2012-02-02 21:11:33' ` to the last query part right under the left join. Now it's working correctly with one nuisance: the 02:00:00 - 03:00:00 interval is not displayed, it starts showing data starting with 03:00:00. What may be the cause here? – sm90901 Aug 03 '12 at 10:57
  • I also changed the SELECT in `WITH cal AS` to `SELECT generate_series('2012-02-02 02:00:00'::timestamp , '2012-02-02 22:00:00'::timestamp , '1 hour'::interval) AS stamp` – sm90901 Aug 03 '12 at 10:58
  • I think I got it, only playing with the executiontime values in the qqq subquery now yields exactly what I want. And the cal.stamp gets values that contain the time interval I gave in the executiontime. I tested the same inputs with my original query that started this 0 count problem and the results are consistent. Thanks again for all your help. – sm90901 Aug 03 '12 at 11:08
0

Look this. Idea is to generate array or table with dates in this period and join with job execution table.

Community
  • 1
  • 1
pawos
  • 31
  • 1