1

I have a query with this look:

select {{AggregateFunctionsInvokations}}
from {{LargeTable}}
where {{DateIntervalOverlappingFiltering}}
  • The table will have more than 100.000.000 rows.
  • The target filtering date interval (variable in each issued query) will be, in each case, from 1 day to 10 months.

My objective is that those kind of queries not take more than 10 seconds, in the worst case. How to achieve it? It's possible?

Following I include the SQL code of the test case performed.

create schema tmp3;

create table tmp3.Info1(
    emissionId text not null,
    userId text not null,
    minInstant timestamp not null,  
    maxInstant timestamp not null
);

insert into tmp3.Info1(
    emissionId,
    userId,
    minInstant,
    maxInstant
)
select
    (floor(random() * (4000 - 22 + 1)) + 22)::text,
    (floor(random() * (250000 - 33 + 1)) + 33)::text,
    min(least(X.a, X.b)),
    max(greatest(X.a, X.b))
from (
    select
        '2015-07-10 20:00:00'::timestamp + random() * ('2016-08-02 16:00:00'::timestamp - '2015-07-10 20:00:00'::timestamp) as a,
        '2015-07-10 20:00:00'::timestamp + random() * ('2016-08-02 16:00:00'::timestamp - '2015-07-10 20:00:00'::timestamp) as b
    from generate_series(1,45000000)
) X
group by 1, 2;

Index needed to ensure to have unique pairs emission+user. (Note: now, for this test, is not needed because the previous "insert" sentence ensure uniqueness, but will be needed for the "normal" operative of my application).

create unique index Info1_UQ001 on tmp3.Info1 using btree (
    emissionId asc,
    userId asc
);

Covering indexes bounded to get the below query take the less possible time. Which one will use the Postgres planner?

create index Info1_IX001 on tmp3.Info1 using btree (minInstant asc, maxInstant asc, userId);
create index Info1_IX002 on tmp3.Info1 using btree (minInstant asc, maxInstant desc, userId);
create index Info1_IX003 on tmp3.Info1 using btree (minInstant desc, maxInstant asc, userId);
create index Info1_IX004 on tmp3.Info1 using btree (minInstant desc, maxInstant desc, userId);

VACUUM reclaims storage occupied by dead tuples. ANALYZE collects statistics about the contents of tables in the database [...] Subsequently, the query planner uses these statistics to help determine the most efficient execution plans for queries.

vacuum analyze tmp3.Info1

RESULTS:

--*
--Range: 2015-12-10...2016-01-25
--Execution plan performed: "Seq Scan" over "tmp3.Info1" table.
--Total execution time: 5 minutes with 16 seconds.
--*
--Range: 2015-09-10...2015-09-21
--Execution plan performed: "Seq Scan" over "tmp3.Info1" table.
--Total execution time: 2 minutes with 47 seconds

select
    min(extract(epoch from (X.maxInstant - X.minInstant))) as minSessionTime,
    round(avg(extract(epoch from (X.maxInstant - X.minInstant)))) as avgSessionTime,
    max(extract(epoch from (X.maxInstant - X.minInstant))) as maxSessionTime,
    count(distinct X.userId) as numUsers
from tmp3.Info1 X
where
    --http://stackoverflow.com/questions/325933/determine-whether-two-date-ranges-overlap/325964#325964
    --(StartA <= EndB) and (EndA >= StartB)
    --min: "2015-07-10 20:00:00"
    --max: "2016-08-02 15:59:59.624544"
    --(X.minInstant <= @EndB)
    --and
    --(X.maxInstant >= @StartB)    
    X.minInstant <= '2016-01-25'::timestamp
    and
    X.maxInstant >= '2015-12-10'::timestamp
Jason Aller
  • 3,541
  • 28
  • 38
  • 38
  • I would start by changing the {emissionid,userid} columns to a numeric data type. They look like low-cardinality key-elements to me. Also: constraints: are the date-intervals supposed to be non-overlapping? Gapfree? BTW: there is no payload-data in the table. – joop Aug 02 '16 at 10:21
  • To change the data type of columns is not an option: I require that both columns have text (or varchar(x)) type. – Roger Dielrton Aug 02 '16 at 10:25
  • About the sparsity/density of data: the row intervals are overlapping between them (some one are overlapped with another one, and will be another, the less one, that not overlap with any else); moreover, there will be gaps (date regions without rows represented) and regions with high density of rows. – Roger Dielrton Aug 02 '16 at 10:30
  • What do you mean about lack of payload? Sure, @joop, the Info1 table contains all the needed information for my purpose: to compile statistics about the session time of users viewing media. – Roger Dielrton Aug 02 '16 at 10:33
  • 2
    It seems that the condition selects more the 1/3 of rows (more then 15 of 45 millions). In this case using index is inefficient. And also the query should to calculate `avg` through this 15 millions. Lets say: query `select sum(x) from t where id between 1 and 30000000` will always be slow. There are reasons why the query is slow. Without knowing about business logic I have no any suggestions how to solve it. – Abelisto Aug 02 '16 at 10:34
  • Dear @Abelisco, my business logic, my objective is this: to compile statistics about the session time of users viewing media. My workhorse is: why the planner is performing a "Seq Scan" over the table and is not using some of ther four indexes? I guarantee you that index usage would improve the times meaningfully. Obviously, the "avg", "min", and "max" calculus is indispensable. – Roger Dielrton Aug 02 '16 at 10:40
  • It is very popular question related not only to the PostgreSQL but to any other DBMS: [why postgres don't use index](http://www.google.com.ua/search?q=why+postgres+do+not+use+index). – Abelisto Aug 02 '16 at 10:52
  • Thank you, Abelisto. I checked the page http://www.postgresonline.com/journal/archives/78-Why-is-my-index-not-being-used.html I run "vacuum analyze tmp3.Info1", and try the query again, but did not improve the time. The 1/3 of rows rate seems to be the reason of the seq scan. – Roger Dielrton Aug 02 '16 at 12:19

0 Answers0