I can confirm that the issue in question is still a problem on Oracle 12.1.0.2.0.
And even hardcoded partition elimination bounds are not enough.
Here is the test table in my case:
CREATE TABLE FR_MESSAGE_PART (
ID NUMBER(38) NOT NULL CONSTRAINT PK_FR_MESSAGE_PART PRIMARY KEY USING INDEX LOCAL,
TRX_ID NUMBER(38) NOT NULL, TS TIMESTAMP NOT NULL, TEXT CLOB)
PARTITION BY RANGE (ID) (PARTITION PART_0 VALUES LESS THAN (0));
CREATE INDEX IX_FR_MESSAGE_PART_TRX_ID ON FR_MESSAGE_PART(TRX_ID) LOCAL;
CREATE INDEX IX_FR_MESSAGE_PART_TS ON FR_MESSAGE_PART(TS) LOCAL;
The table is populated with several millions of records of OLTP production data for several months. Each month belongs to a separate partition.
Primary key values of this table always include time part in higher bits that allows to use ID
for range partitioning by calendar periods. All messages inherit higher time bits of TRX_ID
. This ensures that all messages belonging to the same business operation do always fall in the same partition.
Let's start with hardcoded query for selecting a page of the most recent messages for a given time period with partition elimination bounds applied:
select * from (select * from FR_MESSAGE_PART
where TS >= DATE '2017-11-30' and TS < DATE '2017-12-02'
and ID >= 376894993815568384 and ID < 411234940974268416
order by TS DESC) where ROWNUM <= 40;
But, having freshly gathered table statistics, Oracle optimizer still falsely estimates that sorting two entire monthly partitions would be faster than a range scan for two days by existing local index:
-----------------------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes |TempSpc| Cost (%CPU)| Time | Pstart| Pstop |
-----------------------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 40 | 26200 | | 103K (1)| 00:00:05 | | |
|* 1 | COUNT STOPKEY | | | | | | | | |
| 2 | VIEW | | 803K| 501M| | 103K (1)| 00:00:05 | | |
|* 3 | SORT ORDER BY STOPKEY | | 803K| 70M| 92M| 103K (1)| 00:00:05 | | |
| 4 | PARTITION RANGE ITERATOR| | 803K| 70M| | 86382 (1)| 00:00:04 | 2 | 3 |
|* 5 | TABLE ACCESS FULL | FR_MESSAGE_PART | 803K| 70M| | 86382 (1)| 00:00:04 | 2 | 3 |
-----------------------------------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
1 - filter(ROWNUM<=40)
3 - filter(ROWNUM<=40)
5 - filter("TS"<TIMESTAMP' 2017-12-01 00:00:00' AND "TS">=TIMESTAMP' 2017-11-29 00:00:00' AND
"ID">=376894993815568384)
Actual execution time appears by an order of magnitude longer than estimated in plan.
So we have to apply a hint to force usage of the index:
select * from (select /*+ FIRST_ROWS(40) INDEX(FR_MESSAGE_PART (TS)) */ * from FR_MESSAGE_PART
where TS >= DATE '2017-11-30' and TS < DATE '2017-12-02'
and ID >= 376894993815568384 and ID < 411234940974268416
order by TS DESC) where ROWNUM <= 40;
Now the plan uses the index but still envolves slow sorting of two entire partitions:
-----------------------------------------------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes |TempSpc| Cost (%CPU)| Time | Pstart| Pstop |
-----------------------------------------------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 40 | 26200 | | 615K (1)| 00:00:25 | | |
|* 1 | COUNT STOPKEY | | | | | | | | |
| 2 | VIEW | | 803K| 501M| | 615K (1)| 00:00:25 | | |
|* 3 | SORT ORDER BY STOPKEY | | 803K| 70M| 92M| 615K (1)| 00:00:25 | | |
| 4 | PARTITION RANGE ITERATOR | | 803K| 70M| | 598K (1)| 00:00:24 | 2 | 3 |
|* 5 | TABLE ACCESS BY LOCAL INDEX ROWID BATCHED| FR_MESSAGE_PART | 803K| 70M| | 598K (1)| 00:00:24 | 2 | 3 |
|* 6 | INDEX RANGE SCAN | IX_FR_MESSAGE_PART_TS | 576K| | | 2269 (1)| 00:00:01 | 2 | 3 |
-----------------------------------------------------------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
1 - filter(ROWNUM<=40)
3 - filter(ROWNUM<=40)
5 - filter("ID">=376894993815568384)
6 - access("TS">=TIMESTAMP' 2017-11-30 00:00:00' AND "TS"<TIMESTAMP' 2017-12-02 00:00:00')
After some struggling through Oracle hints reference and google it was found that we also have to explicitly specify the descending direction for index range scan with INDEX_DESC or INDEX_RS_DESC hint:
select * from (select /*+ FIRST_ROWS(40) INDEX_RS_DESC(FR_MESSAGE_PART (TS)) */ * from FR_MESSAGE_PART
where TS >= DATE '2017-11-30' and TS < DATE '2017-12-02'
and ID >= 376894993815568384 and ID < 411234940974268416
order by TS DESC) where ROWNUM <= 40;
This at last gives fast plan with COUNT STOPKEY
per partition which scans partitions in descending order and sorts only at most 40 rows from each partition:
------------------------------------------------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes |TempSpc| Cost (%CPU)| Time | Pstart| Pstop |
------------------------------------------------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 40 | 26200 | | 615K (1)| 00:00:25 | | |
|* 1 | COUNT STOPKEY | | | | | | | | |
| 2 | VIEW | | 803K| 501M| | 615K (1)| 00:00:25 | | |
|* 3 | SORT ORDER BY STOPKEY | | 803K| 70M| 92M| 615K (1)| 00:00:25 | | |
| 4 | PARTITION RANGE ITERATOR | | 803K| 70M| | 598K (1)| 00:00:24 | 3 | 2 |
|* 5 | COUNT STOPKEY | | | | | | | | |
|* 6 | TABLE ACCESS BY LOCAL INDEX ROWID BATCHED| FR_MESSAGE_PART | 803K| 70M| | 598K (1)| 00:00:24 | 3 | 2 |
|* 7 | INDEX RANGE SCAN DESCENDING | IX_FR_MESSAGE_PART_TS | 576K| | | 2269 (1)| 00:00:01 | 3 | 2 |
------------------------------------------------------------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
1 - filter(ROWNUM<=40)
3 - filter(ROWNUM<=40)
5 - filter(ROWNUM<=40)
6 - filter("ID">=376894993815568384)
7 - access("TS">=TIMESTAMP' 2017-11-30 00:00:00' AND "TS"<TIMESTAMP' 2017-12-02 00:00:00')
filter("TS">=TIMESTAMP' 2017-11-30 00:00:00' AND "TS"<TIMESTAMP' 2017-12-02 00:00:00')
This runs blazing fast but estimated plan cost is still falsely too high.
So far so good. Now let's try to make the query parametrized to be used in our custom ORM framework:
select * from (select /*+ FIRST_ROWS(40) INDEX_RS_DESC(FR_MESSAGE_PART (TS)) */ * from FR_MESSAGE_PART
where TS >= :1 and TS < :2
and ID >= :3 and ID < :4
order by TS DESC) where ROWNUM <= 40;
But then COUNT STOPKEY
per partition disappears from the plan as stated in the question and confirmed in the other answer:
----------------------------------------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | Pstart| Pstop |
----------------------------------------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 40 | 26200 | 82349 (1)| 00:00:04 | | |
|* 1 | COUNT STOPKEY | | | | | | | |
| 2 | VIEW | | 153 | 97K| 82349 (1)| 00:00:04 | | |
|* 3 | SORT ORDER BY STOPKEY | | 153 | 14076 | 82349 (1)| 00:00:04 | | |
|* 4 | FILTER | | | | | | | |
| 5 | PARTITION RANGE ITERATOR | | 153 | 14076 | 82348 (1)| 00:00:04 | KEY | KEY |
|* 6 | TABLE ACCESS BY LOCAL INDEX ROWID BATCHED| FR_MESSAGE_PART | 153 | 14076 | 82348 (1)| 00:00:04 | KEY | KEY |
|* 7 | INDEX RANGE SCAN DESCENDING | IX_FR_MESSAGE_PART_TS | 110K| | 450 (1)| 00:00:01 | KEY | KEY |
----------------------------------------------------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
1 - filter(ROWNUM<=40)
3 - filter(ROWNUM<=40)
4 - filter(TO_NUMBER(:4)>TO_NUMBER(:3) AND TO_TIMESTAMP(:2)>TO_TIMESTAMP(:1))
6 - filter("ID">=TO_NUMBER(:3) AND "ID"<TO_NUMBER(:4))
7 - access("TS">=TO_TIMESTAMP(:1) AND "TS"<TO_TIMESTAMP(:2))
filter("TS">=TO_TIMESTAMP(:1) AND "TS"<TO_TIMESTAMP(:2))
Then I tried to retreat to hardcoded monthly-aligned partition elimination bounds but still retain parametrized timestamp bounds to minimize plan cache spoiling.
select * from (select /*+ FIRST_ROWS(40) INDEX_RS_DESC(FR_MESSAGE_PART (TS)) */ * from FR_MESSAGE_PART
where TS >= :1 and TS < :2
and ID >= 376894993815568384 and ID < 411234940974268416
order by TS DESC) where ROWNUM <= 40;
But still got slow plan:
------------------------------------------------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes |TempSpc| Cost (%CPU)| Time | Pstart| Pstop |
------------------------------------------------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 40 | 26200 | | 83512 (1)| 00:00:04 | | |
|* 1 | COUNT STOPKEY | | | | | | | | |
| 2 | VIEW | | 61238 | 38M| | 83512 (1)| 00:00:04 | | |
|* 3 | SORT ORDER BY STOPKEY | | 61238 | 5501K| 7216K| 83512 (1)| 00:00:04 | | |
|* 4 | FILTER | | | | | | | | |
| 5 | PARTITION RANGE ITERATOR | | 61238 | 5501K| | 82214 (1)| 00:00:04 | 3 | 2 |
|* 6 | TABLE ACCESS BY LOCAL INDEX ROWID BATCHED| FR_MESSAGE_PART | 61238 | 5501K| | 82214 (1)| 00:00:04 | 3 | 2 |
|* 7 | INDEX RANGE SCAN DESCENDING | IX_FR_MESSAGE_PART_TS | 79076 | | | 316 (1)| 00:00:01 | 3 | 2 |
------------------------------------------------------------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
1 - filter(ROWNUM<=40)
3 - filter(ROWNUM<=40)
4 - filter(TO_TIMESTAMP(:2)>TO_TIMESTAMP(:1))
6 - filter("ID">=376894993815568384)
7 - access("TS">=TO_TIMESTAMP(:1) AND "TS"<TO_TIMESTAMP(:2))
filter("TS">=TO_TIMESTAMP(:1) AND "TS"<TO_TIMESTAMP(:2))
@ChrisSaxon in his answer here has mentioned that missing nested STOPKEY COUNT
has something to do with filter(TO_TIMESTAMP(:2)>TO_TIMESTAMP(:1))
operation which validates that the upper bound is really bigger than the lower one.
Taking this into account I tried to cheat the oprimizer by transforming TS between :a and :b
into equivalent :b between TS and TS + (:b - :a)
. And this worked!
After some additional investigation of the root cause of this change, I've found that just replacing TS >= :1 and TS < :2
with TS + 0 >= :1 and TS < :2
helps to achieve optimal execution plan.
select * from (select /*+ FIRST_ROWS(40) INDEX_RS_DESC(FR_MESSAGE_PART (TS)) */ * from FR_MESSAGE_PART
where TS + 0 >= :1 and TS < :2
and ID >= 376894993815568384 and ID < 411234940974268416
order by TS DESC) where ROWNUM <= 40;
The plan now has proper COUNT STOPKEY
per partition and a notion of INTERNAL_FUNCTION("TS")+0
which prevented the toxic extra bounds checking filter, I guess.
------------------------------------------------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes |TempSpc| Cost (%CPU)| Time | Pstart| Pstop |
------------------------------------------------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 40 | 26200 | | 10120 (1)| 00:00:01 | | |
|* 1 | COUNT STOPKEY | | | | | | | | |
| 2 | VIEW | | 61238 | 38M| | 10120 (1)| 00:00:01 | | |
|* 3 | SORT ORDER BY STOPKEY | | 61238 | 5501K| 7216K| 10120 (1)| 00:00:01 | | |
| 4 | PARTITION RANGE ITERATOR | | 61238 | 5501K| | 8822 (1)| 00:00:01 | 3 | 2 |
|* 5 | COUNT STOPKEY | | | | | | | | |
|* 6 | TABLE ACCESS BY LOCAL INDEX ROWID BATCHED| FR_MESSAGE_PART | 61238 | 5501K| | 8822 (1)| 00:00:01 | 3 | 2 |
|* 7 | INDEX RANGE SCAN DESCENDING | IX_FR_MESSAGE_PART_TS | 7908 | | | 631 (1)| 00:00:01 | 3 | 2 |
------------------------------------------------------------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
1 - filter(ROWNUM<=40)
3 - filter(ROWNUM<=40)
5 - filter(ROWNUM<=40)
6 - filter("ID">=376894993815568384)
7 - access("TS"<TO_TIMESTAMP(:2))
filter(INTERNAL_FUNCTION("TS")+0>=:1 AND "TS"<TO_TIMESTAMP(:2))
We had to implement the mentioned Oracle-specific + 0
workaround and partition elimination bounds hardcoding in our custom ORM framework. It allows to retain the same fast paging performance after switching to partitioned tables with local indices.
But I wish much patience and sanity to those who venture to do the same switch without complete control of sql-building code.
It appears Oracle has too much pitfalls when partitioning and paging are mixed together. For example, we found that Oracle 12's new OFFSET ROWS / FETCH NEXT ROWS ONLY
syntax sugar is almost unusable with local indexed partitioned tables as most of analytic windowing functions it's based upon.
The shortest working query to fetch some page behind the first one is
select * from (select * from (
select /*+ FIRST_ROWS(200) INDEX_RS_DESC(FR_MESSAGE_PART (TS)) */* from FR_MESSAGE_PART
where TS + 0 >= :1 and TS < :2
and ID >= 376894993815568384 and ID < 411234940974268416
order by TS DESC) where ROWNUM <= 200) offset 180 rows;
Here is an example of actual execution plan after running such query:
SQL_ID c67mmq4wg49sx, child number 0
-------------------------------------
select * from (select * from (select /*+ FIRST_ROWS(200)
INDEX_RS_DESC("FR_MESSAGE_PART" ("TS")) GATHER_PLAN_STATISTICS */ "ID",
"MESSAGE_TYPE_ID", "TS", "REMOTE_ADDRESS", "TRX_ID",
"PROTOCOL_MESSAGE_ID", "MESSAGE_DATA_ID", "TEXT_OFFSET", "TEXT_SIZE",
"BODY_OFFSET", "BODY_SIZE", "INCOMING" from "FR_MESSAGE_PART" where
"TS" + 0 >= :1 and "TS" < :2 and "ID" >= 376894993815568384 and "ID" <
411234940974268416 order by "TS" DESC) where ROWNUM <= 200) offset 180
rows
Plan hash value: 2499404919
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Starts | E-Rows |E-Bytes|E-Temp | Cost (%CPU)| E-Time | Pstart| Pstop | A-Rows | A-Time | Buffers | OMem | 1Mem | Used-Mem |
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | | | | 640K(100)| | | | 20 |00:00:00.01 | 322 | | | |
|* 1 | VIEW | | 1 | 200 | 130K| | 640K (1)| 00:00:26 | | | 20 |00:00:00.01 | 322 | | | |
| 2 | WINDOW NOSORT | | 1 | 200 | 127K| | 640K (1)| 00:00:26 | | | 200 |00:00:00.01 | 322 | 142K| 142K| |
| 3 | VIEW | | 1 | 200 | 127K| | 640K (1)| 00:00:26 | | | 200 |00:00:00.01 | 322 | | | |
|* 4 | COUNT STOPKEY | | 1 | | | | | | | | 200 |00:00:00.01 | 322 | | | |
| 5 | VIEW | | 1 | 780K| 487M| | 640K (1)| 00:00:26 | | | 200 |00:00:00.01 | 322 | | | |
|* 6 | SORT ORDER BY STOPKEY | | 1 | 780K| 68M| 89M| 640K (1)| 00:00:26 | | | 200 |00:00:00.01 | 322 | 29696 | 29696 |26624 (0)|
| 7 | PARTITION RANGE ITERATOR | | 1 | 780K| 68M| | 624K (1)| 00:00:25 | 3 | 2 | 400 |00:00:00.01 | 322 | | | |
|* 8 | COUNT STOPKEY | | 2 | | | | | | | | 400 |00:00:00.01 | 322 | | | |
|* 9 | TABLE ACCESS BY LOCAL INDEX ROWID| FR_MESSAGE_PART | 2 | 780K| 68M| | 624K (1)| 00:00:25 | 3 | 2 | 400 |00:00:00.01 | 322 | | | |
|* 10 | INDEX RANGE SCAN DESCENDING | IX_FR_MESSAGE_PART_TS | 2 | 559K| | | 44368 (1)| 00:00:02 | 3 | 2 | 400 |00:00:00.01 | 8 | | | |
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Outline Data
-------------
/*+
BEGIN_OUTLINE_DATA
IGNORE_OPTIM_EMBEDDED_HINTS
OPTIMIZER_FEATURES_ENABLE('12.1.0.2')
DB_VERSION('12.1.0.2')
OPT_PARAM('optimizer_dynamic_sampling' 0)
OPT_PARAM('_optimizer_dsdir_usage_control' 0)
FIRST_ROWS(200)
OUTLINE_LEAF(@"SEL$3")
OUTLINE_LEAF(@"SEL$2")
OUTLINE_LEAF(@"SEL$1")
OUTLINE_LEAF(@"SEL$4")
NO_ACCESS(@"SEL$4" "from$_subquery$_004"@"SEL$4")
NO_ACCESS(@"SEL$1" "from$_subquery$_001"@"SEL$1")
NO_ACCESS(@"SEL$2" "from$_subquery$_002"@"SEL$2")
INDEX_RS_DESC(@"SEL$3" "FR_MESSAGE_PART"@"SEL$3" ("FR_MESSAGE_PART"."TS"))
END_OUTLINE_DATA
*/
Predicate Information (identified by operation id):
---------------------------------------------------
1 - filter("from$_subquery$_004"."rowlimit_$$_rownumber">180)
4 - filter(ROWNUM<=200)
6 - filter(ROWNUM<=200)
8 - filter(ROWNUM<=200)
9 - filter("ID">=376894993815568384)
10 - access("TS"<:2)
filter((INTERNAL_FUNCTION("TS")+0>=:1 AND "TS"<:2))
Note how much actual fetched rows and time are better than optimizer estimations.
Update
Beware than even this optimal plan could fail down to slow local index full scan in case lower partition elimination bound was guessed too low that the lowest partition doesn't contain enough records to match query filters.
rleishman's Tuning "BETWEEN" Queries states:
The problem is that an index can only scan on one column with a range
predicate (<, >, LIKE, BETWEEN). So even if an index contained both
the lower_bound and upper_bound columns, the index scan will return
all of the rows matching lower_bound <= :b, and then filter the rows
that do not match upper_bound >= :b.
In the case where the sought value is somewhere in the middle, the
range scan will return half of the rows in the table in order to find
a single row. In the worst case where the most commonly sought rows
are at the top (highest values), the index scan will process almost
every row in the table for every lookup.
It means that, unfortunately, Oracle doesn't take into account the lower bound of a range scan filter until it reaches STOPKEY COUNT condition or scans the whole partition!
So we had to limit lower partition elimination bound heuristics to the same month the lower timestamp period bound falls into.
This defends against full index scans at expense of a risk of not showing some delayed transaction messages in the list.
But this can be easily resolved by extending the supplied time period if needed.
I've also tried to apply the same + 0
trick to force optimal plan with dynamic partition elimination bounds binding:
select * from (select /*+ FIRST_ROWS(40) INDEX_RS_DESC(FR_MESSAGE_PART (TS)) */ * from FR_MESSAGE_PART
where TS+0 >= :1 and TS < :2
and ID >= :3 and ID+0 < :4
order by TS DESC) where ROWNUM <= 40;
The plan then still retains proper STOPKEY COUNT
per partition but the partition elimination is lost for upper bound as may be noticed by Pstart
column of plan table:
----------------------------------------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | Pstart| Pstop |
----------------------------------------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 40 | 26200 | 9083 (1)| 00:00:01 | | |
|* 1 | COUNT STOPKEY | | | | | | | |
| 2 | VIEW | | 153 | 97K| 9083 (1)| 00:00:01 | | |
|* 3 | SORT ORDER BY STOPKEY | | 153 | 14076 | 9083 (1)| 00:00:01 | | |
| 4 | PARTITION RANGE ITERATOR | | 153 | 14076 | 9082 (1)| 00:00:01 | 10 | KEY |
|* 5 | COUNT STOPKEY | | | | | | | |
|* 6 | TABLE ACCESS BY LOCAL INDEX ROWID BATCHED| FR_MESSAGE_PART | 153 | 14076 | 9082 (1)| 00:00:01 | 10 | KEY |
|* 7 | INDEX RANGE SCAN DESCENDING | IX_FR_MESSAGE_PART_TS | 11023 | | 891 (1)| 00:00:01 | 10 | KEY |
----------------------------------------------------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
1 - filter(ROWNUM<=40)
3 - filter(ROWNUM<=40)
5 - filter(ROWNUM<=40)
6 - filter("ID">=TO_NUMBER(:3) AND "ID"+0<TO_NUMBER(:4))
7 - access("TS"<TO_TIMESTAMP(:2))
filter(INTERNAL_FUNCTION("TS")+0>=:1 AND "TS"<TO_TIMESTAMP(:2))