1.) Your query did not catch all overlaps - this was fixed by the other answers, already.
2.) The data type of your columns starttime
and endtime
is timestamp
. So your WHERE
clause is slightly wrong, too:
BETWEEN '2011-11-02' AND '2011-11-03'
This would include '2011-11-03 00:00'. The upper border has to be excluded.
3.) Removed the mixed case syntax without double-quotes. Unquoted identifiers are cast to lower case automatically. To put it simple: Best don't use mixed case identifiers at all in PostgreSQL.
4.) Transformed the query to use explicit JOIN which is always preferable. Actually, I made it a LEFT [OUTER] JOIN, because I want to count calls that overlap with no other calls, too.
5.) Simplified the syntax a bit to arrive at this base query:
SELECT t1.sid, count(*) AS ct
FROM calls_nov t1
LEFT JOIN calls_nov t2 ON t1.starttime <= t2.endtime
AND t1.endtime >= t2.starttime
WHERE t1.starttime >= '2011-11-02 0:0'::timestamp
AND t1.starttime < '2011-11-03 0:0'::timestamp
GROUP BY 1
ORDER BY 2 DESC;
This query is extremely slow for a big table, because every row starting on '2011-11-02' has to be compared to every row in the whole table, which leads to (almost) O(n²) cost.
Faster
We can drastically cut down the cost by pre-selecting possible candidates. Only select columns and rows you need. I do this with two CTE.
- Select calls starting on the day in question. -> CTE
x
- Calculate the latest end of those calls. (subquery in CTE
y
)
- Select only calls that overlap with the total range of CTE
x
. -> CTE y
- The final query is much faster than querying the huge underlying table.
WITH x AS (
SELECT sid, starttime, endtime
FROM calls_nov
WHERE starttime >= '2011-11-02 0:0'
AND starttime < '2011-11-03 0:0'
), y AS (
SELECT starttime, endtime
FROM calls_nov
WHERE endtime >= '2011-11-02 0:0'
AND starttime <= (SELECT max(endtime) As max_endtime FROM x)
)
SELECT x.sid, count(*) AS count_overlaps
FROM x
LEFT JOIN y ON x.starttime <= y.endtime
AND x.endtime >= y.starttime
GROUP BY 1
ORDER BY 2 DESC;
Faster yet
I have a real life table of 350.000 rows with overlapping start / end timestamps similar to yours. I used that for a quick benchmark. PostgreSQL 8.4, scarce resources because it is a test DB. Indexes on start
and end
. (Index on ID column is irrelevant here.) Tested with EXPLAIN ANALYZE
, best of 5.
Total runtime: 476994.774 ms
CTE variant:
Total runtime: 4199.788 ms -- that's > factor 100.
After adding a multicolumn index of the form:
CREATE INDEX start_end_index on calls_nov (starttime, endtime);
Total runtime: 4159.367 ms
Ultimate Speed
If that is not enough, there is a way to speed it up yet another order of magnitude. Instead of the CTEs above, materialize the temp tables and - this is the crucial point - create an index on the second one. Could look like this:
Execute as one transaction:
CREATE TEMP TABLE x ON COMMIT DROP AS
SELECT sid, starttime, endtime
FROM calls_nov
WHERE starttime >= '2011-11-02 0:0'
AND starttime < '2011-11-03 0:0';
CREATE TEMP TABLE y ON COMMIT DROP AS
SELECT starttime, endtime
FROM calls_nov
WHERE endtime >= '2011-11-02 0:0'
AND starttime <= (SELECT max(endtime) FROM x);
CREATE INDEX y_idx ON y (starttime, endtime); -- this is where the magic happens
SELECT x.sid, count(*) AS ct
FROM x
LEFT JOIN y ON x.starttime <= y.endtime
AND x.endtime >= y.starttime
GROUP BY 1
ORDER BY 2 DESC;
Read about temporary tables in the manual.
Ultimate solution
Create a plpgsql function that encapsulates the magic.
Diagnose the typical size of your temp tables. Create them standalone and measure:
SELECT pg_size_pretty(pg_total_relation_size('tmp_tbl'));
If they are bigger than your setting for temp_buffers then temporarily set them high enough in your function to hold both your temporary tables in RAM. It is a major speedup if you don't have to swap to disc. (Must be first use of temp tables in session to have effect.)
CREATE OR REPLACE FUNCTION f_call_overlaps(date)
RETURNS TABLE (sid varchar, ct integer) AS
$BODY$
DECLARE
_from timestamp := $1::timestamp;
_to timestamp := ($1 +1)::timestamp;
BEGIN
SET temp_buffers = 64MB'; -- example value; more RAM for temp tables;
CREATE TEMP TABLE x ON COMMIT DROP AS
SELECT c.sid, starttime, endtime -- avoid naming conflict with OUT param
FROM calls_nov c
WHERE starttime >= _from
AND starttime < _to;
CREATE TEMP TABLE y ON COMMIT DROP AS
SELECT starttime, endtime
FROM calls_nov
WHERE endtime >= _from
AND starttime <= (SELECT max(endtime) FROM x);
CREATE INDEX y_idx ON y (starttime, endtime);
RETURN QUERY
SELECT x.sid, count(*)::int -- AS ct
FROM x
LEFT JOIN y ON x.starttime <= y.endtime AND x.endtime >= y.starttime
GROUP BY 1
ORDER BY 2 DESC;
END;
$BODY$ LANGUAGE plpgsql;
Call:
SELECT * FROM f_call_overlaps('2011-11-02') -- just name your date
Total runtime: 138.169 ms -- that's factor 3000
What else can you do to speed it up?
General performance optimization.
CLUSTER calls_nov USING starttime_index; -- this also vacuums the table fully
ANALYZE calls_nov;