Efficiently import csv with 2,000,000 rows in an empty SQLite table

Question

I am trying to bring in a CSV of 2 million rows in a SQLite database but it is taking forever for the data to load.

I calculated it might take upto 3 days for the data to be uploaded. I might have to do this often. Is there a better way to read a CSV and add to an empty table

Below is the code I am trying -

import sqlite3
import csv
import pandas as pd
from csv import reader
cnx = sqlite3.connect('db/homepages.db')
cursor = cnx.cursor()
df = pd.read_sql_query("SELECT * FROM hostname_table", cnx)
with open('lb_csv.csv', 'r') as f:
    reader = csv.reader(f)
    for i, row in enumerate(reader):
        if i%10000 == 0:
            print(i)
        name_input = row[0]
        hostname_input = row[1]
        scheme_input = 'not_available'
        description_input = 'not_available'
        created_date_input = datetime.now()
       try:
           cursor.execute("INSERT INTO hostname_table (name, hostname, scheme, description, created_date) VALUES (?, ?, ?, ?, ?)", (name_input, hostname_input, scheme_input, description_input, created_date_input))
           cnx.commit()
       except Exception as e:
           print(e)
            cnx.rollback()
            pass

Have you tried `executemany()`? https://stackoverflow.com/a/5331961/10035985 — Andrej Kesely, Aug 10 '22 at 18:20
You're making a transaction for every single insert, which is very expensive. Try `executemany` with multiple rows rather than a transaction-per-row. — wkl, Aug 10 '22 at 18:20
Why are you reading the table into a dataframe before starting your insert? Why are you using sqlite3 when this so desperately cries out for a real database? sqlite3 was designed for simplicity and convenience, not for performance. — Tim Roberts, Aug 10 '22 at 18:21
Why is Pandas used here? Python can handle this, just not in the way you're trying. What are you trying to do anyway? You can open a CSV file with SQLite in CSV mode and query it using SQL. You don't need to copy the data into another file first. — Panagiotis Kanavos, Aug 10 '22 at 18:43
@AndrejKesely & wkl you can add executemany() as answer, I will mark as the best answer. Using executemany this was completed in 10 minutes. Thank you. — ledzee, Aug 10 '22 at 19:44

score 3 · Answer 1 · answered Aug 10 '22 at 18:22

3

If performance is an issue, and you can get rid of the constraint to use Python, you could just do:

sqlite> .mode csv
sqlite> .import /tmp/hostname_data.csv hostname_table

answered Aug 10 '22 at 18:22

Pablo Santa Cruz

176,835
32
241
292

how would OP handle the default values and datetime.now()? – tdelaney Aug 10 '22 at 18:28
maybe running a massive update query after the `import` command? – Pablo Santa Cruz Aug 10 '22 at 18:47

score 3 · Answer 2 · answered Aug 10 '22 at 18:28

If you don't need to do it in python, the sqlite command line has a .import command that will directly import csv files using the --csv flag. Here's an example from the documentation at https://www.sqlite.org/cli.html#csv_import

sqlite> .import --csv --skip 1 --schema temp C:/work/somedata.csv tab1

The --skip option tells it how many header rows to skip. The --schema option specifies an alternate schema from the current one to import into. The last argument is the new table name.

wkl · Accepted Answer · 2022-08-10T23:23:31.890

I'm going to stick to the Python since you do some extra stuff in your script rather than just directly reading into the database from the CSV, like setting the creation time or populating multiple fields with "not_available".

The short answer is to batch your writes and use sqlite3.Cursor.executemany, because the cost of a transaction is quite expensive when you're making a transaction for every write.

But that's been answered before - however, I'll show you how much faster batched writes are compared to doing transaction-per-row, as I got speed ups of 300x+ by doing batches.

Test Setup

All tests were run on a macbook pro M1 Max / 32 GB RAM / 1 TB SSD.

For these tests, I generate two-column CSV files of 1 million lines using this command:

hexdump -v -e '4/1 "%02x""\n"' /dev/urandom | \
paste -d "," - - | \
head -n 1000000 > test.csv

And the files look like this:

> head -n 10 test.csv
d224d9ab,76e8eb72
358a23af,6fa66351
a24c330b,d0a15fea
e587bd05,c812d51b
71e23afc,5d4654a7
fac87ef8,0576ee3f
baf53d50,a82bfd43
193cbf62,553cd6bb
ba1f82b5,2200ad0d
2387b69f,6100ae3b

Perf Test - `executemany` with different batch sizes

#!/usr/bin/env python3

# execute_many.py

import csv
import datetime
import itertools
import sqlite3
import time

# This script creates and populates four databases from scratch
# using 4 different CSV files

# modified from grouper() recipe at:
# https://docs.python.org/3/library/itertools.html#itertools-recipes
def file_chunker(fp, batch_size):
        args = [iter(fp)] * batch_size
        return itertools.zip_longest(*args, fillvalue=None)

if __name__ == "__main__":
    # here I created test files called test-1000.csv,
    # test-10000.csv, and so on
    batch_sizes = [1000, 10000, 100000, 1000000]

    for batch_size in batch_sizes:
        db = sqlite3.connect(f"test-{batch_size}.db")
        cur = db.cursor()

        cur.execute("""CREATE TABLE hostname_table (
            name TEXT,
            hostname TEXT,
            scheme TEXT,
            description TEXT,
            created_date INTEGER
        )""")
        db.commit()
        with open(f"test-{batch_size}.csv", "r") as f:
            reader = csv.reader(f)
            chunker = file_chunker(reader, batch_size)
            start = time.perf_counter()
            for batch in chunker:
                data = [
                    (row[0], row[1], "not available", "not available", datetime.datetime.now())
                    for row in batch if row is not None
                ]
                try:
                    cur.executemany("INSERT INTO hostname_table VALUES (?, ?, ?, ?, ?)", data)
                    db.commit()
                except Exception:
                    db.rollback()
            end = time.perf_counter()
            print(f"Elapsed time (batch size: {batch_size}): {end - start}")

        cur.close()

Running this, I get:

❯ python execute_many.py
Elapsed time (batch size: 1000): 1.9760890419711359
Elapsed time (batch size: 10000): 1.7937561669969
Elapsed time (batch size: 100000): 2.017556666978635
Elapsed time (batch size: 1000000): 2.7237499579787254

By reading these files in chunks of 1K/10K/100K/1M lines at a time, you can see that the additionally large batches don't really buy you any savings in time, but they're all very fast to run. Basically 2 seconds to populate a million row database.

Perf Test - `execute` with transaction-per-row

Here's a reduced version of your code that only deals with the sqlite3 portion, and creates a new database and populates it with my 1M line file, using a transaction-per-insert like your current code does.

#!/usr/bin/env python3

# process_data.py

import csv
import datetime
import sqlite3
import time

if __name__ == "__main__":
    db = sqlite3.connect("test.db")
    cur = db.cursor()

    cur.execute("""CREATE TABLE hostname_table (
        name TEXT,
        hostname TEXT,
        scheme TEXT,
        description TEXT,
        created_date INTEGER
    )""")
    db.commit()
    with open("test.csv", "r") as f:
        reader = csv.reader(f)
        start = time.perf_counter()
        for row in reader:
            db_row = (row[0], row[1], "not available", "not available", datetime.datetime.now())
            try:
                cur.execute("INSERT INTO hostname_table VALUES (?, ?, ?, ?, ?)", db_row)
                db.commit()
            except Exception:
                db.rollback()
        end = time.perf_counter()
        print(f"Elapsed time: {end - start}")

    cur.close()

Running this, I get as an elapsed time...

> time python process_data.py
Elapsed time: 390.3731578750303
python process_data.py  10.82s user 173.25s system 47% cpu 6:30.41 total

So 6.5 minutes on my test machine - which shows you how much slower doing a transaction per row is.

Perf Test - Insertion Time

For fun, let's look at how much time it takes to insert a row vs. a batch insert (of 1000)

Here's a script that creates two databases, and one is written to using execute while the other is written to using executemany with a batch of 1000.

#!/usr/bin/env python3

import datetime
import sqlite3
import random
import string
import timeit

def insert_single(db, cursor, data):
    cursor.execute("INSERT INTO hostname_table VALUES (?, ?, ?, ?, ?)", data)
    db.commit() 

def insert_many(db, cursor, data):
    cursor.executemany("INSERT INTO hostname_table VALUES (?, ?, ?, ?, ?)", data)
    db.commit() 

def row_generator(count=1):
    return [
        (''.join(random.choices(string.ascii_letters, k=10)),
         ''.join(random.choices(string.ascii_letters, k=10)),
         'not_available',
         'not_available',
         datetime.datetime.now()
        )
        for _ in range(count)
    ]


if __name__ == "__main__":
    db_single = sqlite3.connect("timeit_test_single.db")
    db_multi = sqlite3.connect("timeit_test_multi.db")

    cur_single = db_single.cursor()
    cur_single.execute("""CREATE TABLE hostname_table (
        name TEXT,
        hostname TEXT,
        scheme TEXT,
        description TEXT,
        created_date INTEGER
    )""")
    db_single.commit()

    cur_multi = db_multi.cursor()
    cur_multi.execute("""CREATE TABLE hostname_table (
        name TEXT,
        hostname TEXT,
        scheme TEXT,
        description TEXT,
        created_date INTEGER
    )""")
    db_multi.commit()


    single_data = row_generator()[0]
    multi_data = row_generator(1000)

    print(timeit.timeit(lambda: insert_single(db_single, cur_single, single_data), number=10000))
    print(timeit.timeit(lambda: insert_many(db_multi, cur_multi, multi_data), number=10000))

I run the timeit test in ipython and here's what I get...

In [16]: %timeit insert_single(db_single, cur_single, single_data)
271 µs ± 71.7 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

In [17]: %timeit insert_many(db_multi, cur_multi, multi_data)
1.44 ms ± 17.1 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

So you can see that for a single execute statement with a commit, it takes roughly 200-300 µs (0.2 - 0.3 ms), whereas to write a batch of 1000 with executemany it takes roughly 5 times longer at 1.4+ ms. Only 5x longer to write 1000x more rows!

Efficiently import csv with 2,000,000 rows in an empty SQLite table

3 Answers3

Test Setup

Perf Test - executemany with different batch sizes

Perf Test - execute with transaction-per-row

Perf Test - Insertion Time

Perf Test - `executemany` with different batch sizes

Perf Test - `execute` with transaction-per-row