Improve performance of NOT EXISTS in case of large tables

Question

What I am trying to accomplish is getting rows from one table that do not match another table based on specific filters. The two tables are relatively huge so I am trying to filter them based on a certain time range.

The steps I went through so far.

Get the IDs from "T1" for the last 3 days

SELECT
id 
FROM T1
WHERE STARTTIME BETWEEN '3 days ago' AND 'now';

Execution time is 4.5s.

Get the IDs from "T2" for the last 3 days

SELECT
id 
FROM T2
WHERE STARTTIME BETWEEN '3 days ago' AND 'now';

Execution time is 2.5s.

Now I try to use NOT EXISTS to merge the results from both statements into one

SELECT
CID
FROM T1
WHERE STARTTIME BETWEEN '3 days ago' AND 'now'
AND NOT EXISTS (
  SELECT NULL FROM T2
  WHERE T1.ID = T2.ID 
  AND STARTTIME BETWEEN '3 days ago' AND 'now'
);

Execution time is 23s.

I also tried the INNER JOIN logic from this answer thinking it makes sense, but I get no results so I cannot properly evaluate.

Is there a better way to construct this statement that could possibly lead to a faster execution time?

19.01.2022 - Update based on comments

Expected result can contain any number of rows between 1 and 10 000
The used columns have the following indexes:

CREATE INDEX IX_T1_CSTARTTIME
   ON T1 (CSTARTTIME ASC)
   TABLESPACE MYHOSTNAME_DATA1;

CREATE INDEX IX_T2_CSTARTTIME
   ON T2 (CSTARTTIME ASC)
   TABLESPACE MYHOSTNAME_DATA2;

NOTE: Just noticed that the indexes are located on different table spaces, could this be a potential issue as well?

Following the excellent comments from Marmite Bomber here is the execution plan for the statement:

---------------------------------------------------------------------------------------------
| Id  | Operation            | Name  | Rows  | Bytes |TempSpc| Cost (%CPU)| Time     |
--------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT     |       | 21773 |  2019K|       |  1817K  (1)| 00:01:12 |
|*  1 |  HASH JOIN RIGHT ANTI|       | 21773 |  2019K|   112M|  1817K  (1)| 00:01:12 |
|*  2 |   TABLE ACCESS FULL  | T2     |  2100K|    88M|       |  1292K  (1)| 00:00:51 |
|*  3 |   TABLE ACCESS FULL  | T1     |  2177K|   105M|       |   512K  (1)| 00:00:21 |
---------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

1 - access("T2"."ID"="T1"."ID")
2 - filter("STARTTIME">=1642336690000 AND "T2"."ID" IS NOT NULL 
           AND "STARTTIME"<=1642595934000)
3 - filter("STARTTIME">=1642336690000 AND 
           "STARTTIME"<=1642595934000)

Column Projection Information (identified by operation id):
-----------------------------------------------------------

1 - (#keys=1; rowset=256) "T1"."ID"[CHARACTER,38]
2 - (rowset=256) "T2"."ID"[CHARACTER,38]
3 - (rowset=256) "ID"[CHARACTER,38]

The problem itself may cause this longer time for the anti-join (the "not exists" condition). If each table has, say, 1 million rows, and each query selects 10,000 rows (last three days in each table), the three-days filter on each table may take a few seconds. But then, the anti-join step must compare 10,000 rows from the first table to 10,000 rows from the second table; that is 100 million comparisons, which of course will take much longer than the initial filtering. — , Jan 13 '22 at 15:48
You didn't tell us what indexes exist on the two tables (if any). Separate indexes on the ID columns and the date columns may help; the indexes on the date columns may make the initial steps faster, but the really important indexes would be those on the ID columns, because they would affect the execution of the anti-join. — , Jan 13 '22 at 15:49
By the way: rewriting your NOT EXISTS condition as a NOT IN condition or as a join (plus some condition) won't make your query faster. Oracle re-writes your condition (regardless of which syntax you use) into its own version of a join, using the most efficient approach (or what the optimizer "thinks" is the most efficient approach, anyway). That's the wrong thing to look at; your query is fine as it is right now. Look for indexes, for statistics - are they current, etc., not the structure of your query. — , Jan 13 '22 at 15:51
@Yanis Petras, You could try this one combining LEFT JOIN and WHERE clauses : SELECT T1.CID FROM T1 LEFT JOIN T2 ON T1.ID = T2.ID AND T2.STARTTIME BETWEEN '3 days ago' AND 'now' WHERE T1.STARTTIME BETWEEN '3 days ago' AND 'now' AND T2.ROWID IS NULL — Mahamoutou, Jan 13 '22 at 16:02
@Mahamoutou Thanks for the suggestion. Unfortunately, performance-wise this statement was almost the same is my initial one :/ — Yanis Petras, Jan 13 '22 at 16:20
@mathguy I just checked the indexes and we have them only on the STARTTIME columns on both tables. — Yanis Petras, Jan 13 '22 at 16:22
Like I said (and you just confirmed), writing the same query using different syntax won't help. Oracle rewrites them all, internally, to the same query it sends for execution. Regarding indexes: see if you get meaningful improvement from adding indexes on ID. (In a big organization, you would do this in the development environment first, before asking your DBA to allow you to create the same indexes in production.) — , Jan 13 '22 at 16:24
Welcome to Stack Overflow! To help you with your [tag:query-optimization] question, we need more information. Please [read this](https://stackoverflow.com/tags/query-optimization/info), then [edit] your question. — O. Jones, Jan 14 '22 at 11:14

score 0 · Answer 1 · answered Jan 13 '22 at 19:03

Is there a better way to construct this statement that could possibility lead to a faster execution time?

Your basic responsibility is to write the SQL staement, the basic responsibility of Oracle is to come with an execution plan

If you are not satified (but you should know that a combination of two sources using NOT EXISTS will take longer that the sum of the time to extract the data from the sources) your fist step should be to verify the execution plan (and not try to rewrite the statement).

See some more details how to proceede here

EXPLAIN PLAN  SET STATEMENT_ID = 'stmt1' into   plan_table  FOR
SELECT
PAD
FROM T1
WHERE STARTTIME BETWEEN date'2021-01-11' AND date'2021-01-13'
AND NOT EXISTS (
  SELECT NULL FROM T2
  WHERE T1.ID = T2.ID 
  AND STARTTIME BETWEEN date'2021-01-11' AND date'2021-01-13'
);

SELECT * FROM table(DBMS_XPLAN.DISPLAY('plan_table', 'stmt1','ALL'));

This is what you should see

-----------------------------------------------------------------------------
| Id  | Operation            | Name | Rows  | Bytes | Cost (%CPU)| Time     |
-----------------------------------------------------------------------------
|   0 | SELECT STATEMENT     |      |  1999 |   150K| 10175   (1)| 00:00:01 |
|*  1 |  HASH JOIN RIGHT ANTI|      |  1999 |   150K| 10175   (1)| 00:00:01 |
|*  2 |   TABLE ACCESS FULL  | T2   |  2002 | 26026 |  4586   (1)| 00:00:01 |
|*  3 |   TABLE ACCESS FULL  | T1   |  4002 |   250K|  5589   (1)| 00:00:01 |
-----------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
 
   1 - access("T1"."ID"="T2"."ID")
   2 - filter("STARTTIME"<=TO_DATE(' 2021-01-13 00:00:00', 'syyyy-mm-dd 
              hh24:mi:ss') AND "STARTTIME">=TO_DATE(' 2021-01-11 00:00:00', 
              'syyyy-mm-dd hh24:mi:ss'))
   3 - filter("STARTTIME"<=TO_DATE(' 2021-01-13 00:00:00', 'syyyy-mm-dd 
              hh24:mi:ss') AND "STARTTIME">=TO_DATE(' 2021-01-11 00:00:00', 
              'syyyy-mm-dd hh24:mi:ss'))

Note that the hash join (here anti due to the not exists) is the best way to join two large row sources. Note also that the plan does not use indexes. The reason is the same - to access large data you do not want to go over index.

Contrary to the case of low cardinality row sources (OTPL) where you expects to see index access and NESTED LOOPS ANTI.

Some times is Oracle confused (e.g. while seeing stale statistics) and decide to go the NESTED LOOPway even for large data - which leads to long elapsed time.

This should help you at least to decide if you have a problem or not.

score 0 · Answer 2 · answered Jan 16 '22 at 23:05

0

Perhaps a simple MINUS operation will accomplish what you are looking for:

    select id 
      from ( select id 
               from t1
              where starttime between '3 days ago' and 'now' 
             MINUS
             select id 
               from t2
              where starttime between '3 days ago' and 'now' 
           );

for however you actually define starttime between '3 days ago' and 'now'. This literally uses your current queries as is the MINUS operation removes from the first those values which do exist in the second and returns the result. See MINUS demo here.

answered Jan 16 '22 at 23:05

Belayer

13,578
2
11
22

Thanks for the suggestion. This one actually gave worse performance - 33s. – Yanis Petras Jan 19 '22 at 12:51
Does anyone know how you can accomplish something similar in MySQL? MySQL doesn't have the MINUS operator. A regular left join takes forever. – Emil Karlsson Nov 18 '22 at 11:00

Improve performance of NOT EXISTS in case of large tables

19.01.2022 - Update based on comments

2 Answers2