cursor.execute
vs cursor.executemany
minimal synthetic benchmark
main.py
from pathlib import Path
import sqlite3
import csv
f = 'tmp.sqlite'
n = 10000000
Path(f).unlink(missing_ok=True)
connection = sqlite3.connect(f)
#connection.set_trace_callback(print)
cursor = connection.cursor()
cursor.execute("CREATE TABLE t (x integer)")
#for i in range(n):
# cursor.execute(f"INSERT INTO t VALUES ({i})")
cursor.executemany(f"INSERT INTO t VALUES (?)", ((str(i),) for i in range(n)))
connection.commit()
connection.close()
Results according to time main.py
:
method |
storage |
time (s) |
executemany |
SSD |
8.6 |
executemany |
:memory: |
9.8 |
executemany |
HDD |
10.4 |
execute |
SSD |
31 |
execute |
:memory: |
29 |
Baseline time of for i in range(n): pass
: 0.3s.
Conclusions:
executemany
is about 3x faster
- we are not I/O bound on
execute
vs executemany
, we are likely memory bound. This might not be very surprising given that the final tmp.sqlite
file is only 115 MB, and that my SSD can handle 3 GB/s
If I enable connection.set_trace_callback(print)
to log queries as per: How can I log queries in Sqlite3 with Python? and reduce n = 5
I see the exact same queries for both execute
and executemany
:
CREATE TABLE t (x integer)
BEGIN
INSERT INTO t VALUES ('0')
INSERT INTO t VALUES ('1')
INSERT INTO t VALUES ('2')
INSERT INTO t VALUES ('3')
INSERT INTO t VALUES ('4')
COMMIT
so it does not seem that the speed difference is linked to transactions, as both appear to execute within a single transaction. There are some comments on automatic transaction control at: https://docs.python.org/3/library/sqlite3.html#transaction-control but they are not super clear, I hope the logs are correct.
Insertion time baseline
I'll be looking out for the fastest possible method I can find to compare it to Python. So far, this generate_series
approach is the winner:
f="10m.sqlite"
rm -f "$f"
sqlite3 "$f" 'create table t(x integer)'
time sqlite3 "$f" 'insert into t select value as x from generate_series(1,10000000)'
finished as fast as 1.4s on SSD, therefore substantially faster than any Python method so far, and well beyond being SSD-bound.
Tested on Ubuntu 23.04, Python 3.11.2, Lenovo ThinkPad P51
- SSD: Samsung MZVLB512HAJQ-000L7 512GB SSD, 3 GB/s nominal speed
- HDD: Seagate ST1000LM035-1RK1 1TB, 140 MB/s nominal speed