1

I would like to update a column based on joins. Unfortunately, the tables that I am updating and merging are rather large: 87,220,021 rows.

Here is the content of the table (that I am merging with itself):

ID            ID_VERSION           VALUE_1           VALUE_2    RES_EXPEC
1                 1                   A               NULL        NULL
1                 2                   A               NULL         1
1                 2                   B               NULL        NULL
1                 3                   B               NULL         1
2                 1                   A               NULL        NULL
2                 1                   B               NULL        NULL
2                 1                   B               NULL        NULL

And here is the code that I am running:

MERGE INTO EXAMP_TAB USING(
     SELECT   ID, ID_VERSION, VALUE_1 FROM EXAMP_TAB) TAB_B
     ON (EXAMP_TAB.ID = TAB_B.ID
         AND EXAMP_TAB.ID_VERSION = (TAB_B.ID_VERSION - 1)
         AND EXAMP_TAB.VALUE_1 = TAB_B.VALUE_1)
WHEN MATCHED THEN
    UPDATE SET EXAMP_TAB.VALUE_2 = 1;

The aim of this operation is to check whether there is an entry with an identical value in VALUE_1 in the previous version of an ID. If so, VALUE_2 is updated.

Unfortunately, this operation takes hours. I read a little bit about indexes but it seems they cannot help me since they do not increase performance on UPDATE-operations.

I am also open to operations that do not involve a Merge-statement.

Arne
  • 57
  • 6
  • Have you considered to not store that data in an additional column but to select it via join in your query? That kind of calculated data is allways hard to maintain and will cost a lot of headaches in the future... – Radagast81 Nov 13 '19 at 08:29
  • What is the expected output here? because your query will not update any of the records for `ID = 2` – Popeye Nov 13 '19 at 08:32
  • @ Tejash yes, if there is no second version of ID = 2 with A and B these rows will remain NULL. Which is what I wanted. – Arne Nov 13 '19 at 08:38
  • @Radagast81 Which data and how to do that :)? – Arne Nov 13 '19 at 08:39
  • `(ID, ID_VERSION) = (1,3)` -- it will have two previous rows i.e. `(1,2)` so which row needs to be compared? -- It will be better if you can share the expected output so that we can help you out. – Popeye Nov 13 '19 at 08:46
  • @Tejash only two entries have a matching entry in their previous version. Therefore, only two entries should be updated to 1 (added an expected result column to the main post). – Arne Nov 13 '19 at 09:01
  • @Arne add a row with (2,0,'B') to your sample data, which has two successors and update the expected update result - this will decide if you can use a `MERGE` statement. – Marmite Bomber Nov 14 '19 at 00:37
  • `> this operation takes hours` please post your execution plan and the number of rows you expect to be updated. – Marmite Bomber Nov 14 '19 at 00:41

4 Answers4

0

You can get that data in your query without changing any data:

-- Your sample data:
WITH EXAMP_TAB (ID,ID_VERSION,VALUE_1) as (
select 1, 1, 'A' from dual union all
select 1, 2, 'A' from dual union all        
select 1, 2, 'B' from dual union all
select 1, 3, 'B' from dual union all
select 2, 1, 'A' from dual union all
select 2, 1, 'B' from dual union all 
select 2, 1, 'B' from dual)
-- Select starting from here:
SELECT id, id_version, value_1
     , CASE WHEN EXISTS (SELECT 1 FROM EXAMP_TAB B
                          WHERE B.ID = A.ID
                            AND B.ID_VERSION = A.ID_VERSION - 1
                            AND B.VALUE_1 = A.VALUE_1)
            THEN 1
       END value_2
  FROM EXAMP_TAB A

Result:

ID   ID_VERSION   VALUE_1   VALUE_2
1    1            A         NULL 
1    2            A         1
1    2            B         NULL 
1    3            B         1
2    1            A         NULL 
2    1            B         NULL 
2    1            B         NULL 

That solution has the benefit that you not have to worry about keeping that data up-to-date.

Radagast81
  • 2,921
  • 1
  • 7
  • 21
  • Sorry I am very unfamiliar with SQL to be honest. How can I generalize your top selects? I have 28 000 000 rows and I cannot write 28 000 000 lines of code after the WITH-command :) .. the result deviates from my expected result because you joined on a "+ 1" instead of an "- 1" I think. – Arne Nov 13 '19 at 09:03
  • Just Skip the `WITH`-lines that is only for me and others here to get your sample data in our database to work with. – Radagast81 Nov 13 '19 at 09:05
  • I have changed that, but your merge will do the exact same (marking the wrong line), as you have the "-1" on the side of your table B and not on the side of table A. – Radagast81 Nov 13 '19 at 09:08
  • Im confused because the result you posted is equal to my expected result. Furtheremore, the query also takes quite while I am afraid :( – Arne Nov 13 '19 at 09:31
  • The confusion might be, because i changed the query and result once you added your expected result. If the query is slow you can add an index to your table containing at least `ID` and `ID_VERSION` and perhaps even `VALUE_1`. – Radagast81 Nov 13 '19 at 09:43
  • I will try that. I was not keen on using indexes because I heard they slow down things when using update. But they speed up selects. So I will just test it. – Arne Nov 13 '19 at 10:24
  • @Arne *beware of generall rules* - index can speed up an Update and slow down a query as well;). What you want to avoid, is to use an index to update *millions* of rows - here you need a `full scan` + `hash join` – Marmite Bomber Nov 13 '19 at 22:33
0

You can simply use update statement for performance as following:

Update EXAMP_TAB E
   SET VALUE_2 = 1 
 WHERE ROWID IN 
       (SELECT E1.ROWID 
          FROM EXAMP_TAB E1
          JOIN EXAMP_TAB E2
            ON E1.ID = E2.ID
           AND E1.ID_VERSION = E2.ID_VERSION - 1
           AND E1.VALUE_1 = E2.VALUE_2)

Here, index on ID, ID_VERSION and VALUE_1 will improve the performance of overall update operation.

Cheers!!

Popeye
  • 35,427
  • 4
  • 10
  • 31
0

I would phrase this slightly differently:

MERGE INTO EXAMP_TAB USING
      (SELECT ID, ID_VERSION, VALUE_1
       FROM EXAMP_TAB
      ) TAB_B
      ON TAB_B.ID = EXAMP_TAB.ID AND
         TAB_B.ID_VERSION = EXAMP_TAB.ID_VERSION + 1 AND
         TAB_B.VALUE_1 = EXAMP_TAB.VALUE_1
WHEN MATCHED THEN
    UPDATE SET EXAMP_TAB.VALUE_2 = 1;

Then, an index CAN help. You want an index on EXAMP_TAB(ID, VALUE_1, ID_VERSION). This index is used for the "join" implicit in this code.

That said, what you are learning is that updating 89 million rows is really inefficient. I would recommend that you just fetch the value using a query when you need it. You can use window functions:

select e.*,
       (case when lag(id_version) over (partition by id, value_1 order by id_version) = id_version - 1
             then 1 else 0
        end) as value_2
from examp_tab;

With the above index, this should have quite good performance.

Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786
0

In the first step you should calculate how may rows you will update. If the figure is low (say in thousends) look after indexes. If you have to update millions of rows index access is a NO GO.

Your merge statement is OK, but unfortunatelly works only in case the table has a primary key on ID, ID_VERSION, VALUE_1. Which seem not to be the case based on your last two rows in sample data.

Simple test the MERGE statement with this data

        ID ID_VERSION VALUE_1 VALUE_2   
---------- ---------- ----    ----------
         1          1 A              
         1          2 A              
         1          2 A 

The MERGE failes with ORA-30926: unable to get a stable set of rows in the source tables as it doesn't know which of the two rows should be updated.

If you want to update both (all) such rows use a two step approach with a temporary table.

Create Temporary Table with All Row to be Updated

create table to_upd_tm as
select distinct a.ID, a.ID_VERSION, a.VALUE_1
from EXAMP_TAB  a
join EXAMP_TAB  b 
on a.id = b.id and
a.value_1 = b.value_1 and
a.id_version = b.id_version + 1 
;
create unique index to_upd_tm_idx1 on  to_upd_tm(ID, ID_VERSION, VALUE_1);

Note that DISTINCTin the select removes the duplicates.

The unique index on the temporary table is required for enforcement of key-preserved table in the next step.

Use Updatable Join View for a Performant Bulk Update

update (
select a.*
from EXAMP_TAB a
join to_upd_tm b
on a.id = b.id and
a.value_1 = b.value_1 and
a.id_version = b.id_version)
set value_2 = 1;

Result

        ID ID_VERSION VALU VALUE_2   
---------- ---------- ---- ----------
         1          1 A              
         1          2 A    1         
         1          2 A    1 

In generall for bulk UPDATE you want to see an exceution plan with FULL TABLE SCAN on both tables using a HASH JOIN as follows

-----------------------------------------------------------------------------------------
| Id  | Operation           | Name      | Rows  | Bytes |TempSpc| Cost (%CPU)| Time     |
-----------------------------------------------------------------------------------------
|   0 | UPDATE STATEMENT    |           |   642K|    41M|       |  4251   (1)| 00:00:52 |
|   1 |  UPDATE             | EXAMP_TAB |       |       |       |            |          |
|*  2 |   HASH JOIN         |           |   642K|    41M|    25M|  4251   (1)| 00:00:52 |
|   3 |    TABLE ACCESS FULL| TO_UPD_TM |   642K|    18M|       |   329   (2)| 00:00:04 |
|   4 |    TABLE ACCESS FULL| EXAMP_TAB |   857K|    30M|       |   651   (1)| 00:00:08 |
-----------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - access("A"."ID"="B"."ID" AND "A"."VALUE_1"="B"."VALUE_1" AND 
              "A"."ID_VERSION"="B"."ID_VERSION")
Marmite Bomber
  • 19,886
  • 4
  • 26
  • 53