How to improve performance of a JOIN of two SCD2 tables in Oracle SQL

Question

I have two tables, both using valid to and valid from logic. Table 1 looks like this:

ID | VALID_FROM | VALID_TO 
1  | 01.01.2000 | 04.01.2000
1  | 04.01.2000 | 16.01.2000
1  | 16.01.2000 | 17.01.2000
1  | 17.01.2000 | 19.01.2000
2  | 03.02.2001 | 04.04.2001
2  | 04.04.2001 | 14.03.2001
2  | 14.04.2001 | 18.03.2001

while table 2 looks like this:

ID | VAR | VALID_FROM | VALID_TO 
1  |  3  | 01.01.2000 | 17.01.2000
1  |  2  | 17.01.2000 | 19.01.2000
2  |  4  | 03.02.2001 | 14.03.2001

Table 1 has 132,195,791 rows and table 2 has 16,964,846.
The valid from and valid to date of any observation in table 1 is within one or more valid from to valid to windows shown in table 2.
I created primary keys for both of them over ID and VALID_FROM
I want to do an inner join like:

    select t1.*, 
           t2.var 
      from t1 t1
inner join t2 t2
        on t1.id = t2.id
       and t1.valid_from >= t2.valid_from
       and t1.valid_to <= t2.valid_to;

This join is really slow. I ran it half a day without any success. What can I do to increase performance in this particular case? Please note that I also want to left join the resulting table in later stages. Any help is highly appreciated.

EDIT

Obviously, the information I gave was less then generally desired here on the platform.

I use Oracle Database 12c Enterprise Edition
The example I gave was illustrative for the bigger problem at hand. I am concerned with joining information from different tables with different valid_from / valid_to dates. For this I created a grid first with the distinct values in the valid_from and valid_to variables of all the relevant tables. This grid is what I refer here to as table 1.
Results from the execution plan (I adjusted the column and table names to meet the terminology used in my illustrative example):

    --------------------------------------------------------------------------------------
    | Id  | Operation          | Name    | Rows  | Bytes |TempSpc| Cost (%CPU)| Time     |
    --------------------------------------------------------------------------------------
    |   0 | SELECT STATEMENT   |         |   465M|    23G|       |   435K  (3)| 00:00:18 |
    |*  1 |  HASH JOIN         |         |   465M|    23G|   695M|   435K  (3)| 00:00:18 |
    |   2 |   TABLE ACCESS FULL| TABLE2 |    16M|   501M|       | 22961   (2)| 00:00:01 |
    |   3 |   TABLE ACCESS FULL| TABLE1 |   132M|  3025M|       |   145K  (2)| 00:00:06 |
    --------------------------------------------------------------------------------------

    Query Block Name / Object Alias (identified by operation id):
    -------------------------------------------------------------

       1 - SEL$58A6D7F6
       2 - SEL$58A6D7F6 / T2@SEL$1
       3 - SEL$58A6D7F6 / T1@SEL$1

    Predicate Information (identified by operation id):
    ---------------------------------------------------

       1 - access("T1"."ID"="T2"."ID")
           filter("T1"."VALID_TO"<="T2"."VALID_TO" AND 
                  "T1"."VALID_FROM">="T2"."VALID_FROM")

    Column Projection Information (identified by operation id):
    -----------------------------------------------------------

       1 - (#keys=1) "T2"."ID"[VARCHAR2,20], 
           "T1"."ID"[VARCHAR2,20], "T1"."VALID_TO"[DATE,7], 
           "T2"."VAR"[VARCHAR2,20], "T2"."VALID_FROM"[DATE,7], 
           "T2"."VALID_TO"[DATE,7], "T1"."ID"[VARCHAR2,20], 
           "T1"."VALID_FROM"[DATE,7], "T1"."VALID_TO"[DATE,7], "T1"."VALID_FROM"[DATE,7]
       2 - "T2"."ID"[VARCHAR2,20], 
           "T2"."VAR"[VARCHAR2,20], "T2"."VALID_FROM"[DATE,7], 
           "T2"."VALID_TO"[DATE,7]
       3 - "T1"."ID"[VARCHAR2,20], "T1"."VALID_FROM"[DATE,7], 
           "T1"."VALID_TO"[DATE,7]

    Note
    -----
       - this is an adaptive plan

What indexes are on the tables? I'd try a compound one on `id`, `valid_from` and `valid_to` on both of the tables each. Maybe also include `var` as last column in the index for `t2` then it possibly can run on just the indexes. — sticky bit, May 17 '19 at 23:01
Have you considered to modify the relationship model by adding a `VAR` column to Table1 so that Table1 may have an FK to Table2? — Little Santi, May 17 '19 at 23:33
See [here](https://stackoverflow.com/a/34975420/4808122) what *minimum information* you should provide. Generall advice: *check your object statistics* (verify if the figures you posted is the same as the impression of Oracle. For your query: **get rid of all indexes** and check if **HASH JOIN** is used. — Marmite Bomber, May 18 '19 at 05:41
Voted up because this model is often chosen for flexibility despite being notoriously hard to query efficiently. However @MarmiteBomber is right, more details would really help. — William Robertson, May 18 '19 at 10:36
What exactly the result of this query is needed for ? Since this query did not end for half a day and it is not really known how long it will take (2 days ? A week or a month?), so maybe it is better to simply skip this step and focus on the optmalization of the next stage ? it looks unrealistic to speed this query up from half hours a day (or more) to only an hour or two. — krokodilko, May 18 '19 at 11:54
@krokodilko IMO the oposite is true. To get a query from **days** to **hours** is in most cases easy because it is coused by trivial reasons (table is locked, never ending nested loops, excesive calls of a subquery,..). The hard job is to get the query from **hours** to **minutes**;) — Marmite Bomber, May 19 '19 at 10:15
I'd guess the problem is related to the temporary tablespace required to hash join the two tables. But it's better if we don't have to guess. Next time you run the query, generate a SQL monitor report and post the results here: `select dbms_sqltune.report_sql_monitor(sql_id => 'your SQL_ID') from dual;` That will tell us what operations in the execution plan are slow, and what resource they are waiting on. Since this statement must join many rows before filtering them it may need a huge amount of temporary tablespace, and may be waiting for more space to be added. — Jon Heller, May 21 '19 at 04:51
@stickybit: There were no indexes on the tables at all. I created the indexes as suggested and this way my query took only about 2 hours. Thanks! — Roberto Liebscher, May 21 '19 at 06:40
@krokodilko: The example I gave was illustrative for the bigger problem at hand. I am concerned with joining information from different tables with different valid_from / valid_to dates. For this I created a grid first with the distinct values in the valid_from and valid_to variables of all the relevant tables. This grid is what I refer here to as table 1. Then after the mentioned inner join I left join this table with all kind of different tables. — Roberto Liebscher, May 21 '19 at 06:56
@JonHeller: I am pretty new to this. Can you tell me how I find the SQL ID? I tried select sql_id from v$sql_monitor without success (table or view not available). — Roberto Liebscher, May 21 '19 at 07:08
@RobertoLiebscher You can find the SQL_ID in `GV$SQL` with a statement like this: `select sql_id, sql_text from gv$sql where lower(sql_fulltext) like '%t2.var%';` There are a lot of statements in there (including the SQL used to find the SQL), so you may have to do some additional filtering. — Jon Heller, May 21 '19 at 19:36
@Roberto Liebscher I understand, You "created a grid ..." etc. in a hope that this step will speed up the following steps. But this did not work - **this very first step** builds - according to the plan a huge hash table of the size 23GB !!!! - see this: `* 1 | HASH JOIN | | 465M| ***23G***|`. Unless you have at least 25~30GB RAM (in order to fit all this monster hash table in memory), this query must last for days or weeks, swapping this hash table from memory to disk. My intent is - don't do this step at all, skip it and concentrate on optimising the following steps. — krokodilko, May 21 '19 at 21:49

score 1 · Answer 1 · answered May 19 '19 at 10:11

A good practice is to ask first: what is expected the query will return?

Base on your WHERE predicate is seems you are interested on all versions from table2 that are included in the validity interval of table1. This may be intention, but more common you need all versions that intersect between the tables.

The second aspect is, do you need to see few first rows or all rows from the join.

If you only want to see few results, simple add AND t1.ID = nnnn to the WHERE clause to limit to some sample ID. If you have proper indexes (and tehre are no expreme lot of rows with this ID), you will get the result quick as NESTED LOOP join will kick in.

To perform the the full result, you must consider all rows from both tables. No index will help you to select all rows from a table - here is the FULL TABLE SCAN the best option.

To join the large row sets the best approach is HASH JOIN. NESTED LOOPS (which you probably use now) are quick to join few rows, but hangs on large row sets.

The smaller table (table2) is red in memory (hopefully) as a hash table. The larger table (table1) is probed against this hash table toperform the join.

This is the execution plan you should look for

-----------------------------------------------------------------------------------
| Id  | Operation          | Name | Rows  | Bytes |TempSpc| Cost (%CPU)| Time     |
-----------------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      |    10T|   399T|       |   190M(100)| 02:03:47 |
|*  1 |  HASH JOIN         |      |    10T|   399T|   550M|   190M(100)| 02:03:47 |
|   2 |   TABLE ACCESS FULL| SCD2 |    16M|   355M|       |    39  (93)| 00:00:01 |
|   3 |   TABLE ACCESS FULL| SCD1 |   132M|  2395M|       |   211  (99)| 00:00:01 |
-----------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - access("T1"."ID"="T2"."ID")
       filter("T1"."VALID_FROM">="T2"."VALID_FROM" AND 
              "T1"."VALID_TO"<="T2"."VALID_TO")

Provided you are on an enterprise database this should pass you from days to hours. Further you can deploy parallel option to get additional speed up.

Good luck!

How to improve performance of a JOIN of two SCD2 tables in Oracle SQL

1 Answers1