I'm going to stick to the Python since you do some extra stuff in your script rather than just directly reading into the database from the CSV, like setting the creation time or populating multiple fields with "not_available"
.
The short answer is to batch your writes and use sqlite3.Cursor.executemany
, because the cost of a transaction is quite expensive when you're making a transaction for every write.
But that's been answered before - however, I'll show you how much faster batched writes are compared to doing transaction-per-row, as I got speed ups of 300x+ by doing batches.
Test Setup
All tests were run on a macbook pro M1 Max / 32 GB RAM / 1 TB SSD.
For these tests, I generate two-column CSV files of 1 million lines using this command:
hexdump -v -e '4/1 "%02x""\n"' /dev/urandom | \
paste -d "," - - | \
head -n 1000000 > test.csv
And the files look like this:
> head -n 10 test.csv
d224d9ab,76e8eb72
358a23af,6fa66351
a24c330b,d0a15fea
e587bd05,c812d51b
71e23afc,5d4654a7
fac87ef8,0576ee3f
baf53d50,a82bfd43
193cbf62,553cd6bb
ba1f82b5,2200ad0d
2387b69f,6100ae3b
Perf Test - executemany
with different batch sizes
#!/usr/bin/env python3
# execute_many.py
import csv
import datetime
import itertools
import sqlite3
import time
# This script creates and populates four databases from scratch
# using 4 different CSV files
# modified from grouper() recipe at:
# https://docs.python.org/3/library/itertools.html#itertools-recipes
def file_chunker(fp, batch_size):
args = [iter(fp)] * batch_size
return itertools.zip_longest(*args, fillvalue=None)
if __name__ == "__main__":
# here I created test files called test-1000.csv,
# test-10000.csv, and so on
batch_sizes = [1000, 10000, 100000, 1000000]
for batch_size in batch_sizes:
db = sqlite3.connect(f"test-{batch_size}.db")
cur = db.cursor()
cur.execute("""CREATE TABLE hostname_table (
name TEXT,
hostname TEXT,
scheme TEXT,
description TEXT,
created_date INTEGER
)""")
db.commit()
with open(f"test-{batch_size}.csv", "r") as f:
reader = csv.reader(f)
chunker = file_chunker(reader, batch_size)
start = time.perf_counter()
for batch in chunker:
data = [
(row[0], row[1], "not available", "not available", datetime.datetime.now())
for row in batch if row is not None
]
try:
cur.executemany("INSERT INTO hostname_table VALUES (?, ?, ?, ?, ?)", data)
db.commit()
except Exception:
db.rollback()
end = time.perf_counter()
print(f"Elapsed time (batch size: {batch_size}): {end - start}")
cur.close()
Running this, I get:
❯ python execute_many.py
Elapsed time (batch size: 1000): 1.9760890419711359
Elapsed time (batch size: 10000): 1.7937561669969
Elapsed time (batch size: 100000): 2.017556666978635
Elapsed time (batch size: 1000000): 2.7237499579787254
By reading these files in chunks of 1K/10K/100K/1M lines at a time, you can see that the additionally large batches don't really buy you any savings in time, but they're all very fast to run. Basically 2 seconds to populate a million row database.
Perf Test - execute
with transaction-per-row
Here's a reduced version of your code that only deals with the sqlite3
portion, and creates a new database and populates it with my 1M line file, using a transaction-per-insert like your current code does.
#!/usr/bin/env python3
# process_data.py
import csv
import datetime
import sqlite3
import time
if __name__ == "__main__":
db = sqlite3.connect("test.db")
cur = db.cursor()
cur.execute("""CREATE TABLE hostname_table (
name TEXT,
hostname TEXT,
scheme TEXT,
description TEXT,
created_date INTEGER
)""")
db.commit()
with open("test.csv", "r") as f:
reader = csv.reader(f)
start = time.perf_counter()
for row in reader:
db_row = (row[0], row[1], "not available", "not available", datetime.datetime.now())
try:
cur.execute("INSERT INTO hostname_table VALUES (?, ?, ?, ?, ?)", db_row)
db.commit()
except Exception:
db.rollback()
end = time.perf_counter()
print(f"Elapsed time: {end - start}")
cur.close()
Running this, I get as an elapsed time...
> time python process_data.py
Elapsed time: 390.3731578750303
python process_data.py 10.82s user 173.25s system 47% cpu 6:30.41 total
So 6.5 minutes on my test machine - which shows you how much slower doing a transaction per row is.
Perf Test - Insertion Time
For fun, let's look at how much time it takes to insert a row vs. a batch insert (of 1000)
Here's a script that creates two databases, and one is written to using execute
while the other is written to using executemany
with a batch of 1000.
#!/usr/bin/env python3
import datetime
import sqlite3
import random
import string
import timeit
def insert_single(db, cursor, data):
cursor.execute("INSERT INTO hostname_table VALUES (?, ?, ?, ?, ?)", data)
db.commit()
def insert_many(db, cursor, data):
cursor.executemany("INSERT INTO hostname_table VALUES (?, ?, ?, ?, ?)", data)
db.commit()
def row_generator(count=1):
return [
(''.join(random.choices(string.ascii_letters, k=10)),
''.join(random.choices(string.ascii_letters, k=10)),
'not_available',
'not_available',
datetime.datetime.now()
)
for _ in range(count)
]
if __name__ == "__main__":
db_single = sqlite3.connect("timeit_test_single.db")
db_multi = sqlite3.connect("timeit_test_multi.db")
cur_single = db_single.cursor()
cur_single.execute("""CREATE TABLE hostname_table (
name TEXT,
hostname TEXT,
scheme TEXT,
description TEXT,
created_date INTEGER
)""")
db_single.commit()
cur_multi = db_multi.cursor()
cur_multi.execute("""CREATE TABLE hostname_table (
name TEXT,
hostname TEXT,
scheme TEXT,
description TEXT,
created_date INTEGER
)""")
db_multi.commit()
single_data = row_generator()[0]
multi_data = row_generator(1000)
print(timeit.timeit(lambda: insert_single(db_single, cur_single, single_data), number=10000))
print(timeit.timeit(lambda: insert_many(db_multi, cur_multi, multi_data), number=10000))
I run the timeit
test in ipython
and here's what I get...
In [16]: %timeit insert_single(db_single, cur_single, single_data)
271 µs ± 71.7 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
In [17]: %timeit insert_many(db_multi, cur_multi, multi_data)
1.44 ms ± 17.1 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
So you can see that for a single execute
statement with a commit
, it takes roughly 200-300 µs (0.2 - 0.3 ms), whereas to write a batch of 1000 with executemany
it takes roughly 5 times longer at 1.4+ ms. Only 5x longer to write 1000x more rows!