I have a query with this look:
select {{AggregateFunctionsInvokations}}
from {{LargeTable}}
where {{DateIntervalOverlappingFiltering}}
- The table will have more than 100.000.000 rows.
- The target filtering date interval (variable in each issued query) will be, in each case, from 1 day to 10 months.
My objective is that those kind of queries not take more than 10 seconds, in the worst case. How to achieve it? It's possible?
Following I include the SQL code of the test case performed.
create schema tmp3;
create table tmp3.Info1(
emissionId text not null,
userId text not null,
minInstant timestamp not null,
maxInstant timestamp not null
);
insert into tmp3.Info1(
emissionId,
userId,
minInstant,
maxInstant
)
select
(floor(random() * (4000 - 22 + 1)) + 22)::text,
(floor(random() * (250000 - 33 + 1)) + 33)::text,
min(least(X.a, X.b)),
max(greatest(X.a, X.b))
from (
select
'2015-07-10 20:00:00'::timestamp + random() * ('2016-08-02 16:00:00'::timestamp - '2015-07-10 20:00:00'::timestamp) as a,
'2015-07-10 20:00:00'::timestamp + random() * ('2016-08-02 16:00:00'::timestamp - '2015-07-10 20:00:00'::timestamp) as b
from generate_series(1,45000000)
) X
group by 1, 2;
Index needed to ensure to have unique pairs emission+user. (Note: now, for this test, is not needed because the previous "insert" sentence ensure uniqueness, but will be needed for the "normal" operative of my application).
create unique index Info1_UQ001 on tmp3.Info1 using btree (
emissionId asc,
userId asc
);
Covering indexes bounded to get the below query take the less possible time. Which one will use the Postgres planner?
create index Info1_IX001 on tmp3.Info1 using btree (minInstant asc, maxInstant asc, userId);
create index Info1_IX002 on tmp3.Info1 using btree (minInstant asc, maxInstant desc, userId);
create index Info1_IX003 on tmp3.Info1 using btree (minInstant desc, maxInstant asc, userId);
create index Info1_IX004 on tmp3.Info1 using btree (minInstant desc, maxInstant desc, userId);
VACUUM reclaims storage occupied by dead tuples. ANALYZE collects statistics about the contents of tables in the database [...] Subsequently, the query planner uses these statistics to help determine the most efficient execution plans for queries.
vacuum analyze tmp3.Info1
RESULTS:
--*
--Range: 2015-12-10...2016-01-25
--Execution plan performed: "Seq Scan" over "tmp3.Info1" table.
--Total execution time: 5 minutes with 16 seconds.
--*
--Range: 2015-09-10...2015-09-21
--Execution plan performed: "Seq Scan" over "tmp3.Info1" table.
--Total execution time: 2 minutes with 47 seconds
select
min(extract(epoch from (X.maxInstant - X.minInstant))) as minSessionTime,
round(avg(extract(epoch from (X.maxInstant - X.minInstant)))) as avgSessionTime,
max(extract(epoch from (X.maxInstant - X.minInstant))) as maxSessionTime,
count(distinct X.userId) as numUsers
from tmp3.Info1 X
where
--http://stackoverflow.com/questions/325933/determine-whether-two-date-ranges-overlap/325964#325964
--(StartA <= EndB) and (EndA >= StartB)
--min: "2015-07-10 20:00:00"
--max: "2016-08-02 15:59:59.624544"
--(X.minInstant <= @EndB)
--and
--(X.maxInstant >= @StartB)
X.minInstant <= '2016-01-25'::timestamp
and
X.maxInstant >= '2015-12-10'::timestamp