6

I'm using SQLAlchemy 1.0.0, and want to make some UPDATE ONLY (update if match primary key else do nothing) queries in batch.

I've made some experiment and found that bulk update looks much slower than bulk insert or bulk upsert.

Could you please help me to point out why it works so slow or is there any alternative way/idea to make the BULK UPDATE (not BULK UPSERT) with SQLAlchemy ?

Below is the table in MYSQL:

CREATE TABLE `test` (
  `id` int(11) unsigned NOT NULL,
  `value` int(11) DEFAULT NULL,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci;

And the test code:

from sqlalchemy import create_engine, text
import time

driver = 'mysql'
host = 'host'
user = 'user'
password = 'password'
database = 'database'
url = "{}://{}:{}@{}/{}?charset=utf8".format(driver, user, password, host, database)

engine = create_engine(url)
engine.connect()

engine.execute('TRUNCATE TABLE test')

num_of_rows = 1000

rows = []
for i in xrange(0, num_of_rows):
    rows.append({'id': i, 'value': i})

print '--------- test insert --------------'
sql = '''
    INSERT INTO test (id, value)
    VALUES (:id, :value)
'''
start = time.time()
engine.execute(text(sql), rows)
end = time.time()
print 'Cost {} seconds'.format(end - start)

print '--------- test upsert --------------'
for r in rows:
    r['value'] = r['id'] + 1

sql = '''
    INSERT INTO test (id, value)
    VALUES (:id, :value)
    ON DUPLICATE KEY UPDATE value = VALUES(value)
'''
start = time.time()
engine.execute(text(sql), rows)
end = time.time()
print 'Cost {} seconds'.format(end - start)

print '--------- test update --------------'
for r in rows:
    r['value'] = r['id'] * 10

sql = '''
    UPDATE test
    SET value = :value
    WHERE id = :id
'''
start = time.time()
engine.execute(text(sql), rows)
end = time.time()
print 'Cost {} seconds'.format(end - start)

The output when num_of_rows = 100:

--------- test insert --------------
Cost 0.568960905075 seconds
--------- test upsert --------------
Cost 0.569655895233 seconds
--------- test update --------------
Cost 20.0891299248 seconds

The output when num_of_rows = 1000:

--------- test insert --------------
Cost 0.807548999786 seconds
--------- test upsert --------------
Cost 0.584554195404 seconds
--------- test update --------------
Cost 206.199367046 seconds

The network latency to database server is around 500ms.

Looks like in bulk update it send and execute each query one by one, not in batch?

Thanks in advance.

twds
  • 333
  • 1
  • 4
  • 15
  • There is no such thing as bulk update. Try to encapsulate all updates in a single transaction and see how that performs. – Shadow Oct 27 '15 at 07:30
  • @Shadow it still very slow(cost nearly the same time), looks like it just send each update query to database one by one, while the network latency of each send operation is 500ms. – twds Oct 27 '15 at 07:59
  • 1
    Then generate the sql string yourself with all update statements separated by ; and send it as a single batch to your db. – Shadow Oct 27 '15 at 08:28
  • @Shadow thanks, it works much faster now. But is it the same way for SQLAlchemy to implement bulk insert / bulk upsert? Previously I thought it send the SQL template to database, and then send the batch of data to server, so I thought it should also work for bulk update. – twds Oct 27 '15 at 09:03
  • I repeat: there is no such thing as bulk update. There is a bulk insert, but as @pi pinted out in his response, sqlalchemy does not even use that syntax. I'm not entirely sure if a mass insert and then an update using joins would be much faster than sending the update statements in a batch using a single transaction. – Shadow Oct 27 '15 at 16:50

1 Answers1

5

You can speed up bulk update operations with a trick, even if the database-server (like in your case) has a very bad latency. Instead of updating your table directly, you use a stage-table to insert your new data very fast, then do one join-update to the destination-table. This also has the advantage that you reduce the number of statements you have to send to the database quite dramatically.

How does this work with UPDATEs?

Say you have a table entries and you have new data coming in all the time, but you only want to update those which have already been stored. You create a copy of your destination-table entries_stage with only the relevant fields in it:

entries = Table('entries', metadata,
    Column('id', Integer, autoincrement=True, primary_key=True),
    Column('value', Unicode(64), nullable=False),
)

entries_stage = Table('entries_stage', metadata,
    Column('id', Integer, autoincrement=False, unique=True),
    Column('value', Unicode(64), nullable=False),
)

Then you insert your data with a bulk-insert. This can be sped up even further if you use MySQL's multiple value insert syntax, which isn't natively supported by SQLAlchemy, but can be built without much difficulty.

INSERT INTO enries_stage (`id`, `value`)
VALUES
(1, 'string1'), (2, 'string2'), (3, 'string3'), ...;

In the end, you update the values of the destination-table with the values from the stage-table like this:

 UPDATE entries e
 JOIN entries_stage es ON e.id = es.id
 SET e.value = es.value
 WHERE e.value != es.value;

Then you're done.

What about inserts?

This also works to speed up inserts of course. As you already have the data in the stage-table, all you need to do is issue a INSERT INTO ... SELECT statement, with the data which is not in destination-table yet.

INSERT INTO entries (id, value)
SELECT FROM entries_stage es
LEFT JOIN entries e ON e.id = es.id
HAVING e.id IS NULL;

The nice thing about this is that you don't have to do INSERT IGNORE, REPLACE or ON DUPLICATE KEY UPDATE, which will increment your primary key, even if they will do nothing.

pi.
  • 21,112
  • 8
  • 38
  • 59