19

I have a table that holds phone calls, with the following fields:

  • ID
  • STARTTIME
  • ENDTIME
  • STATUS
  • CALL_FROM
  • CALL_TO

There are 2,9 million records loaded into a local PostgreSQL database. I added indexes on ID (unique index), starttime and endtime.

Searching on stackoverflow, I found some useful SQL and modified it to what I think logically should work. The problem is that the query runs for many hours and never returns:

SELECT T1.sid, count(*) as CountSimultaneous
FROM calls_nov T1, calls_nov T2
WHERE
     T1.StartTime between T2.StartTime and T2.EndTime
     and T1.StartTime between '2011-11-02' and '2011-11-03'
GROUP BY
     T1.sid
ORDER BY CountSimultaneous DESC;

Can someone please either suggest a way to fix the query/index so that it actually works or suggest another way to calculate concurrent calls?

EDIT:

Explain plan:

Sort  (cost=11796758237.81..11796758679.47 rows=176663 width=35)
  Sort Key: (count(*))
  ->  GroupAggregate  (cost=0.00..11796738007.56 rows=176663 width=35)
        ->  Nested Loop  (cost=0.00..11511290152.45 rows=57089217697 width=35)

Table creation script:

CREATE TABLE calls_nov (
  sid varchar,
  starttime timestamp, 
  endtime timestamp, 
  call_to varchar, 
  call_from varchar, 
  status varchar);

Index creation:

CREATE UNIQUE INDEX sid_unique_index on calls_nov (sid);

CREATE INDEX starttime_index on calls_nov (starttime);

CREATE INDEX endtime_index on calls_nov (endtime);
Pan
  • 331
  • 1
  • 7
Sologoub
  • 5,312
  • 6
  • 37
  • 65
  • T1 and T2 are the same?? – fge Jan 04 '12 at 20:47
  • Can you provide the explain plan? http://www.postgresql.org/docs/8.1/static/sql-explain.html Also, assuming that "sid" is the ID, including it in the select and grouping by it doesn't make sense - the "count" would always be 1. – Neville Kuyt Jan 04 '12 at 20:48
  • @fge - Of course they are...it's a log of calls. He wants to know how many simultaneous calls were also happening during each call. – Eric Jan 04 '12 at 20:48
  • SID is the unique ID of each call. – Sologoub Jan 04 '12 at 20:49
  • In that case, what happens if you do: SELECT count(*) as CountSimultaneous FROM calls_nov T1, calls_nov T2 WHERE T1.StartTime between T2.StartTime and T2.EndTime and T1.StartTime between '2011-11-02' and '2011-11-03' – Neville Kuyt Jan 04 '12 at 20:51
  • @NevilleK this is the result of explain plan "Aggregate (cost=11144150221.35..11144150221.36 rows=1 width=0)" – Sologoub Jan 04 '12 at 21:01
  • Looks like it's not using the indexes - can you also post the table creation scripts? – Neville Kuyt Jan 04 '12 at 21:05
  • 1
    Added create table and index scripts. Thank you! – Sologoub Jan 04 '12 at 21:14
  • I'd try a compound index on start and end time - looks like it's not using the individual keys. – Neville Kuyt Jan 04 '12 at 21:17
  • Please don't use implicit join syntax (comma-seperated list after `FROM`), as it's to simple to get into cross-joins, or other things (I believe it may be out-of-standard, now, but included for backwards compatability). Always use the explicit syntax, as in @Eric's answer. – Clockwork-Muse Jan 04 '12 at 21:26
  • agreed on the implicit joins. Also added index for starttime, endtime, but no gain in performance. – Sologoub Jan 04 '12 at 21:34

4 Answers4

9

1.) Your query did not catch all overlaps - this was fixed by the other answers, already.

2.) The data type of your columns starttime and endtime is timestamp. So your WHERE clause is slightly wrong, too:

BETWEEN '2011-11-02' AND '2011-11-03'

This would include '2011-11-03 00:00'. The upper border has to be excluded.

3.) Removed the mixed case syntax without double-quotes. Unquoted identifiers are cast to lower case automatically. To put it simple: Best don't use mixed case identifiers at all in PostgreSQL.

4.) Transformed the query to use explicit JOIN which is always preferable. Actually, I made it a LEFT [OUTER] JOIN, because I want to count calls that overlap with no other calls, too.

5.) Simplified the syntax a bit to arrive at this base query:

SELECT t1.sid, count(*) AS ct
FROM   calls_nov t1
LEFT   JOIN calls_nov t2 ON t1.starttime <= t2.endtime
                        AND t1.endtime >= t2.starttime
WHERE  t1.starttime >= '2011-11-02 0:0'::timestamp
AND    t1.starttime <  '2011-11-03 0:0'::timestamp
GROUP  BY 1
ORDER  BY 2 DESC;

This query is extremely slow for a big table, because every row starting on '2011-11-02' has to be compared to every row in the whole table, which leads to (almost) O(n²) cost.


Faster

We can drastically cut down the cost by pre-selecting possible candidates. Only select columns and rows you need. I do this with two CTE.

  1. Select calls starting on the day in question. -> CTE x
  2. Calculate the latest end of those calls. (subquery in CTE y)
  3. Select only calls that overlap with the total range of CTE x. -> CTE y
  4. The final query is much faster than querying the huge underlying table.
WITH x AS (
    SELECT sid, starttime, endtime
    FROM   calls_nov
    WHERE  starttime >= '2011-11-02 0:0'
    AND    starttime <  '2011-11-03 0:0'
    ), y AS (
    SELECT starttime, endtime
    FROM   calls_nov
    WHERE  endtime >= '2011-11-02 0:0'
    AND    starttime <= (SELECT max(endtime) As max_endtime FROM x)
    )
SELECT x.sid, count(*) AS count_overlaps
FROM   x
LEFT   JOIN y ON x.starttime <= y.endtime
             AND x.endtime >= y.starttime
GROUP  BY 1
ORDER  BY 2 DESC;

Faster yet

I have a real life table of 350.000 rows with overlapping start / end timestamps similar to yours. I used that for a quick benchmark. PostgreSQL 8.4, scarce resources because it is a test DB. Indexes on start and end. (Index on ID column is irrelevant here.) Tested with EXPLAIN ANALYZE, best of 5.

Total runtime: 476994.774 ms

CTE variant:
Total runtime: 4199.788 ms -- that's > factor 100.

After adding a multicolumn index of the form:

CREATE INDEX start_end_index on calls_nov (starttime, endtime);

Total runtime: 4159.367 ms


Ultimate Speed

If that is not enough, there is a way to speed it up yet another order of magnitude. Instead of the CTEs above, materialize the temp tables and - this is the crucial point - create an index on the second one. Could look like this:

Execute as one transaction:

CREATE TEMP TABLE x ON COMMIT DROP AS   
    SELECT sid, starttime, endtime
    FROM   calls_nov
    WHERE  starttime >= '2011-11-02 0:0'
    AND    starttime <  '2011-11-03 0:0';

CREATE TEMP TABLE y ON COMMIT DROP AS
    SELECT starttime, endtime
    FROM   calls_nov
    WHERE  endtime >= '2011-11-02 0:0'
    AND    starttime <= (SELECT max(endtime) FROM x);

CREATE INDEX y_idx ON y (starttime, endtime); -- this is where the magic happens

SELECT x.sid, count(*) AS ct
FROM   x
LEFT   JOIN y ON x.starttime <= y.endtime
             AND x.endtime >= y.starttime
GROUP  BY 1
ORDER  BY 2 DESC;

Read about temporary tables in the manual.


Ultimate solution

  • Create a plpgsql function that encapsulates the magic.

  • Diagnose the typical size of your temp tables. Create them standalone and measure:

      SELECT pg_size_pretty(pg_total_relation_size('tmp_tbl'));
    
  • If they are bigger than your setting for temp_buffers then temporarily set them high enough in your function to hold both your temporary tables in RAM. It is a major speedup if you don't have to swap to disc. (Must be first use of temp tables in session to have effect.)

CREATE OR REPLACE FUNCTION f_call_overlaps(date)
  RETURNS TABLE (sid varchar, ct integer) AS
$BODY$
DECLARE
    _from timestamp := $1::timestamp;
    _to   timestamp := ($1 +1)::timestamp;
BEGIN

SET temp_buffers = 64MB'; -- example value; more RAM for temp tables;

CREATE TEMP TABLE x ON COMMIT DROP AS   
    SELECT c.sid, starttime, endtime  -- avoid naming conflict with OUT param
    FROM   calls_nov c
    WHERE  starttime >= _from
    AND    starttime <  _to;

CREATE TEMP TABLE y ON COMMIT DROP AS
    SELECT starttime, endtime
    FROM   calls_nov
    WHERE  endtime >= _from
    AND    starttime <= (SELECT max(endtime) FROM x);

CREATE INDEX y_idx ON y (starttime, endtime);

RETURN QUERY
SELECT x.sid, count(*)::int -- AS ct
FROM   x
LEFT   JOIN y ON x.starttime <= y.endtime AND x.endtime >= y.starttime
GROUP  BY 1
ORDER  BY 2 DESC;

END;
$BODY$   LANGUAGE plpgsql;

Call:

SELECT * FROM f_call_overlaps('2011-11-02') -- just name your date

Total runtime: 138.169 ms -- that's factor 3000


What else can you do to speed it up?

General performance optimization.

CLUSTER calls_nov USING starttime_index; -- this also vacuums the table fully

ANALYZE calls_nov;
Community
  • 1
  • 1
Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
  • Thanks. The explain plan is still not inspiring: ort (cost=4785158982.43..4785158982.93 rows=200 width=32) however, it appears to be 10x better than the previous ones. Running the query now, hopefully it will return. – Sologoub Jan 06 '12 at 02:14
  • 2
    @Sologoub: I added quite a bit more to my answer. – Erwin Brandstetter Jan 06 '12 at 02:57
  • Ok... just WOW! Thank you for all the info. Working on absorbing the knowledge :) – Sologoub Jan 06 '12 at 03:11
  • well, got an error: ERROR: structure of query does not match function result type SQL state: 42804 Detail: Returned type character varying does not match expected type integer in column 1. Context: PL/pgSQL function "f_call_overlaps" line 23 at RETURN QUERY – Sologoub Jan 06 '12 at 03:44
  • modifying to this seems to allow the function to run: RETURNS TABLE (sid varchar, ct bigint) – Sologoub Jan 06 '12 at 03:49
  • @Sologoub: Right, your table definition said `varchar`. Fixed my answer, too. I am very curious if you get a similar speedup? BTW, Adjusting `temp_buffers` is probably not necessary, if your db config has sane parameters. The temp tables should not be that huge. (But testing >> guessing) – Erwin Brandstetter Jan 06 '12 at 03:53
  • I'll be ecstatic if this thing just returns results this side of the century :) All my previous attempts ran for over 10 hours, after which I'd kill them and start over. – Sologoub Jan 06 '12 at 03:57
  • @Sologoub: You'd better test with a small time-slice first, like `starttime >= '2011-11-02 0:0' AND starttime < '2011-11-02 1:0'. – Erwin Brandstetter Jan 06 '12 at 04:06
  • Performance-wise this worked great - query returned in 1,520,759ms. However, the resulting data is puzzling - according to it, I have 37k concurrent phone calls. That's not possible. I need to go through and figure this out, but your answer definitely helped with performance. Thanks again! – Sologoub Jan 06 '12 at 04:59
  • I finally have enough reputation to comment. This answer tells you how many other calls begun while the current call was in progress. For a call lasting a long time, this query will return insane values. A very short call will do the same. While your question is ambiguous, I am certain that this is not the answer you were seeking. I believe that you wanted to know the max amount of active calls at any given time. See my answer below for the solution if that's the case. – Pan Jan 10 '20 at 21:50
8

Here's what the possible overlaps look like, where 'A' is the "reference" interval. Note that the query below (far, far below) doesn't give the same result as any of the answers yet posted.

-- A            |------|
-- B |-|
-- C        |---|
-- D          |---|
-- E             |---|
-- F               |---|
-- G                 |---|
-- H                   |---|
-- I                       |---|

"B" doesn't overlap "A" at all. "C" abuts it. {"D", "E", "F", "G"} overlaps it. "H" abuts it. "I" doesn't overlap it at all.

create table calls_nov (
  sid varchar(5) primary key,
  starttime timestamp not null,
  endtime timestamp not null
);  

insert into calls_nov values
('A', '2012-01-04 08:00:00', '2012-01-04 08:00:10'),
('B', '2012-01-04 07:50:00', '2012-01-04 07:50:03'),
('C', '2012-01-04 07:59:57', '2012-01-04 08:00:00'),
('D', '2012-01-04 07:59:57', '2012-01-04 08:00:03'),
('E', '2012-01-04 08:00:01', '2012-01-04 08:00:04'),
('F', '2012-01-04 08:00:07', '2012-01-04 08:00:10'),
('G', '2012-01-04 08:00:07', '2012-01-04 08:00:13'),
('H', '2012-01-04 08:00:10', '2012-01-04 08:00:13'),
('I', '2012-01-04 08:00:15', '2012-01-04 08:00:18');

You can see all the overlapping intervals like this. (I just used to_char() to make it easy to see all the data. You can omit it in production.)

select t1.sid, to_char(t1.starttime, 'HH12:MI:SS'), 
               to_char(t1.endtime,   'HH12:MI:SS'), 
       t2.sid, to_char(t2.starttime, 'HH12:MI:SS'), 
               to_char(t2.endtime,   'HH12:MI:SS')
from calls_nov t1
inner join calls_nov t2 on (t2.starttime, t2.endtime) 
                  overlaps (t1.starttime, t1.endtime) 
order by t1.sid, t2.sid;

A   08:00:00   08:00:10   A   08:00:00   08:00:10
A   08:00:00   08:00:10   D   07:59:57   08:00:03
A   08:00:00   08:00:10   E   08:00:01   08:00:04
A   08:00:00   08:00:10   F   08:00:07   08:00:10
A   08:00:00   08:00:10   G   08:00:07   08:00:13
B   07:50:00   07:50:03   B   07:50:00   07:50:03
C   07:59:57   08:00:00   C   07:59:57   08:00:00
C   07:59:57   08:00:00   D   07:59:57   08:00:03
D   07:59:57   08:00:03   A   08:00:00   08:00:10
D   07:59:57   08:00:03   C   07:59:57   08:00:00
D   07:59:57   08:00:03   D   07:59:57   08:00:03
D   07:59:57   08:00:03   E   08:00:01   08:00:04
E   08:00:01   08:00:04   A   08:00:00   08:00:10
E   08:00:01   08:00:04   D   07:59:57   08:00:03
E   08:00:01   08:00:04   E   08:00:01   08:00:04
F   08:00:07   08:00:10   A   08:00:00   08:00:10
F   08:00:07   08:00:10   F   08:00:07   08:00:10
F   08:00:07   08:00:10   G   08:00:07   08:00:13
G   08:00:07   08:00:13   A   08:00:00   08:00:10
G   08:00:07   08:00:13   F   08:00:07   08:00:10
G   08:00:07   08:00:13   G   08:00:07   08:00:13
G   08:00:07   08:00:13   H   08:00:10   08:00:13
H   08:00:10   08:00:13   G   08:00:07   08:00:13
H   08:00:10   08:00:13   H   08:00:10   08:00:13
I   08:00:15   08:00:18   I   08:00:15   08:00:18

You can see from this table that "A" should count 5, including itself. "B" should count 1; it overlaps itself, but no other intervals overlap it. That seems the right thing to do.

Counting is straightforward, but runs like a ruptured turtle. That's because evaluating an overlap takes a lot of work.

select t1.sid, count(t2.sid) as num_concurrent
from calls_nov t1
inner join calls_nov t2 on (t2.starttime, t2.endtime) 
                  overlaps (t1.starttime, t1.endtime) 
group by t1.sid
order by num_concurrent desc;

A   5
D   4
G   4
E   3
F   3
H   2
C   2
I   1
B   1

To get better performance, you can use the "table" above in a common table expression, and count based on that.

with interval_table as (
select t1.sid as sid_1, t1.starttime, t1.endtime,
       t2.sid as sid_2, t2.starttime, t2.endtime
from calls_nov t1
inner join calls_nov t2 on (t2.starttime, t2.endtime) 
                  overlaps (t1.starttime, t1.endtime) 
order by t1.sid, t2.sid
) 
select sid_1, count(sid_2) as num_concurrent
from interval_table
group by sid_1
order by num_concurrent desc;
Mike Sherrill 'Cat Recall'
  • 91,602
  • 17
  • 122
  • 185
  • thank you for the very informative answer! However, when I ran the explain plan, the query with the table expression is orders of magnitude worse: "Sort (cost=2566228269298.11..2566228269298.61 rows=200 width=64)" vs "Sort (cost=11294858654.81..11294859096.47 rows=176663 width=35)" for @Eric's answer. Could this be a case of a missing index? – Sologoub Jan 05 '12 at 16:57
  • I have indexes on starttime and endtime. The CTE is lots faster here, but I don't have 2.9 million rows. – Mike Sherrill 'Cat Recall' Jan 05 '12 at 17:07
  • Yeah, I've indexed on that as well, but maybe it's just my local box is not strong enough for this. – Sologoub Jan 05 '12 at 20:50
2

I'm assuming that you want to know the amount of active calls at any given time. Other answers give you how many other calls were active while the current call was active. For very long calls, this can give you very high numbers. It was indicated to me that the amount of active calls is what you wanted from one of your comments to the other answers (additionally, I also work in telecom). Unfortunately, I don't have enough reputation to comment that answer yet, as I created my account to answer this question. To get the number of active calls, you could use a variable which increases by one when a call is started and decreases by one when it's ended. I have tested this on a MySQL database with 50+ million calls. Sorry about any syntax differences between MySQL and pgsql.

I added temporary tables for speed, but with only 2m rows and indexes, they may not be needed. MySQL cannot reference the same temporary table twice, so I had to create two.

CREATE TEMPORARY TABLE a
SELECT sid, StartTime, EndTime 
FROM calls_nov
WHERE StartTime between '2011-11-02' and '2011-11-03';

CREATE TEMPORARY TABLE b
SELECT *
FROM a;

SET @i := 0;

SELECT *, @i := @i + c.delta AS concurrent
FROM (
  SELECT StartTime AS time, 1 AS delta
  FROM a
  UNION ALL
  SELECT EndTime AS time, -1 AS delta
  FROM b
  ORDER BY time
) AS c
ORDER BY concurrent DESC
;

The inner SELECT returns two columns. The time column includes each StartTime and each EndTime from the original table (twice the amount of rows), and the delta column is +1 or -1 depending on which column was put in 'time'. This set is ordered by time, which we can then iterate through in the outer SELECT.

Instead of "ORDER BY concurrent DESC" as you had in your query, I would use an additional outer SELECT where I could get MAX, MIN etc. values and I could also GROUP BY date, hour etc. This part of the query (ORDER BY concurrent DESC), I actually did not test. I used my own suggestion with an additional outer query, as ORDER BY does not perform as expected in MySQL when ordering by a variable that was set in the same SELECT. It orders by the previous value of the variable instead. If you absolutely need to order by concurrent calls (and pgsql has the same problem), I believe that you could get around this by again using an additional outer SELECT and ordering there.

The query I ran was very fast! It scans through each temporary table once, and then the combination of the of the two once (with less data per row), and for my own version with an additional outer query it scans through the combination once again and then groups it. Each table is only scanned once! This will all be done in RAM if your configuration and hardware allows it. Other answers (or questions) will help you if it does not.

Pan
  • 331
  • 1
  • 7
1

Try this in lieu of your between and a cross join:

select
    t1.sid,
    count(1) as CountSimultaneous
from
   calls_nov t1
   inner join nov t2 on
       t1.starttime <= t2.endtime
       and t1.endtime >= t2.starttime
where
    t1.starttime between '2011-11-02' and '2011-11-03'
group by
    t1.sid
order by CountSimultaneous desc
Eric
  • 92,005
  • 12
  • 114
  • 115
  • This is close, but needs `and t1.sid != t2.sid` to ensure the same row doesn't get joined to itself – rejj Jan 04 '12 at 20:50
  • @reff - I thought about that, but I didn't put it in. In reality, you could put that condition in, make it a `left join`, and `count(t2.sid)`, and it would just give you 1 less for each number. Or you could do `count(1)-1`. Probably a wash either way. – Eric Jan 04 '12 at 20:53
  • it should be ok without it, as I'm looking for the number of concurrent calls. Counting self is ok. – Sologoub Jan 04 '12 at 20:53
  • The query cost is a little better, but overall it's still in the billions... Cost for a result set with 1 row is 91k. Not really sure what's going on here. – Sologoub Jan 04 '12 at 20:57
  • 1
    @Eric - Your join condition is wrong, you are repeating the same condition – Lamak Jan 04 '12 at 20:59
  • @Sologoub - Try that--I erred in my join conditions. Lamak, thanks for pointing it out. – Eric Jan 04 '12 at 21:01