1

I using this query to insert new entries to my table

MERGE INTO CLEAN clean USING DUAL ON (clean.id = :id)
WHEN NOT MATCHED THEN INSERT (ID, COUNT) VALUES (:id, :xcount)
WHEN MATCHED THEN UPDATE SET clean.COUNT = clean.count + :xcount

It seems that I do more inserts than updates, is there a way to improve my current performance?

I am using cx_Oracle with Python 3 and OracleDB 19c.

john
  • 33
  • 4
  • 1
    You problem is not caused by the `MERGE` statement, but due to the fact that you are calling it **in the loop** - so called *row by row aka slow by slow*. You must change it to the call of *one* `MERGE` statement, where you replace the `dual` with the *temporary table* containg all the rows to be processed. – Marmite Bomber Apr 19 '21 at 12:57
  • @MarmiteBomber, could you explain more? I didn't quite understand – john Apr 19 '21 at 13:18
  • Well there is a big difference is fou call *one* `MERGE` to process 1M rows (= quick) or one *million* times a `MERGE` statement to process *one rows* (= much slower and your case). – Marmite Bomber Apr 19 '21 at 13:53
  • @MarmiteBomber Is there a way I can do it from python? I am just using execute_many – john Apr 19 '21 at 13:59
  • For sure there will be a way, post the whole statement, so the logic is visible... – Marmite Bomber Apr 19 '21 at 14:14
  • @MarmiteBomber, this is the while statement. I just pass `execute_many` a dict with all the data – john Apr 19 '21 at 16:49
  • Yes I understood. You are dooing nothing wrong, but you should expect 20-30 secs to merge 1M rows. Is this something you are observing? Pls check my answer for further posibilities. – Marmite Bomber Apr 19 '21 at 17:21

2 Answers2

3

If you would have a massive problems with you approach, you are very probably missing an index on the column clean.id, that is required for your approach when the MERGE uses dual as a source for each row.

This is less probable while you are saying the id is a primary key.

So basically you are doing the right think and you will see execution plan similar as the one below:

---------------------------------------------------------------------------------------------------
| Id  | Operation                       | Name            | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------------------------------
|   0 | MERGE STATEMENT                 |                 |       |       |     2 (100)|          |
|   1 |  MERGE                          | CLEAN           |       |       |            |          |
|   2 |   VIEW                          |                 |       |       |            |          |
|   3 |    NESTED LOOPS OUTER           |                 |     1 |    40 |     2   (0)| 00:00:01 |
|   4 |     TABLE ACCESS FULL           | DUAL            |     1 |     2 |     2   (0)| 00:00:01 |
|   5 |     VIEW                        | VW_LAT_A18161FF |     1 |    38 |     0   (0)|          |
|   6 |      TABLE ACCESS BY INDEX ROWID| CLEAN           |     1 |    38 |     0   (0)|          |
|*  7 |       INDEX UNIQUE SCAN         | CLEAN_UX1       |     1 |       |     0   (0)|          |
---------------------------------------------------------------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
 
   7 - access("CLEAN"."ID"=:ID)

So the execution plan is fine and works effectively, but it has one problem.

Remember always you use an index, you will be happy while processing few rows, but it will not scale.

If you are processing a millions of records, you may fall back to a two step processing,

  • insert all rows in a temporary table

  • perform a single MERGE statement using the temporary table

The big advantage is that Oracle can open a hash join and get rid of the index access for each of the million rows.

Here an example of a test of the clean table initiated with 1M id (not shown) and performing 1M insert and 1M updates:

n  = 1000000
data2 = [{"id" : i, "xcount" :1} for i in range(2*n)]  

sql3 = """
    insert into tmp (id,count)
    values (:id,:xcount)"""
sql4 = """MERGE into clean USING tmp on (clean.id = tmp.id)
          when not matched then insert (id, count)  values (tmp.id, tmp.count)
          when matched then update set clean.count= clean.count + tmp.count"""    

cursor.executemany(sql3, data2)
cursor.execute(sql4)

The test runs in aprox. 10 second, which is less than a half of you approach with MERGEusing dual.

If this is still not enough, you'll have to use parallel option.

Marmite Bomber
  • 19,886
  • 4
  • 26
  • 53
1

MERGE is quite fast. Inserts being faster then updates, I'd say (usually).

So, if you're asking how to make inserts faster, then it depends.

  • If you're inserting one row at a time, there shouldn't be any bottleneck.
  • If you're inserting millions of rows, see whether there are triggers enabled on the table which fire for each row and do something (slowing the process down).

As of updates, is there index on clean.id column? If not, it would probably help.

Otherwise, see what explain plan says; collect statistics regularly.

Littlefoot
  • 131,892
  • 15
  • 35
  • 57
  • `clean.id` is the primary key and there aren't any triggers. How can I get the explain from python? – john Apr 19 '21 at 13:18
  • I wouldn't know; it is much closer to Oracle than Python and I get it from e.g. SQL Developer or TOAD (or even SQL*Plus) - in other words, tools I use to access my Oracle database. As of ID being the primary key, it also means that there's an unique index on that column as well so - yes, it is indexed. – Littlefoot Apr 19 '21 at 17:54
  • 1
    BTW, @Marmite Bomber nicely explained it in their answer. – Littlefoot Apr 19 '21 at 17:56