What is the fastest way to bulk load a file (created by Postgres) into an in-memory SQLite database using Python 3?

Question

I have a large dataset I want to load into a SQLite in-memory database. I plan on loading the data from a file exported from Postgres. What file format and load mechanism is the fastest?

Currently I'm considering the following two options:

Importing a CSV file (copy). Reference.
Running a SQL file (pg_dump) with INSERT statements using a single transaction. Reference.

Which is faster? Is there a third faster option, maybe?

This will be done as part of a Python 3 script. Does that affect the choice?

If nobody has any experience with this, I'll make sure to post benchmarks as an answer later.

Edit: This question has gotten a downvote. From the comments it seems this is due to the lack of benchmarking. If not, please let me know how to improve this question. I definitely don't expect anybody to perform benchmarking for me. I'm simply hoping that someone has prior experience with bulk loading into SQLite.

@JLH I'm hoping someone with experience bulk loading into SQLite had an answer before I went to the trouble of double implementing the solution for testing purposes. If I don't hear from anybody I'll do the tests and post them here as an answer. I definitely don't want anybody to go to the trouble of doing benchmarks for me. I was only hoping that someone might have done this before. I've added the python 3 tag, thanks. — André C. Andersen, Jun 20 '17 at 15:56

score 1 · Accepted Answer · answered Jun 26 '17 at 12:56

Turns out there is no great way to do this using pg_dump and insert statements in a performant way. We end up with inserting line-by-line from the source file both when we use the CSV and the pg_dump strategies. We're going with the CSV method loading 10000 rows each batch using executemany.

import sqlite3
from datetime import datetime
import csv

conn = sqlite3.connect(":memory:")
cur = conn.cursor()
create_query = """
    CREATE VIRTUAL TABLE my_table USING fts4(
        id INTEGER,
        my_field TEXT
    );
"""
cur.execute(create_query)

csv.field_size_limit(2147483647)

from_time = datetime.now()
with open('test.csv', 'r', encoding="utf8") as file:
    csv_file = csv.reader(file)
    header = next(csv_file)
    query_template = """
        INSERT INTO my_table (id, my_field) 
        VALUES (?, ?);
    """
    for batch in split_iterable_by_size(csv_file, 10000):
        cur.executemany(query_template, batch)
        conn.commit()

On our system and dataset this took 2 hours 30 minutes. We're not testing the alternative.

What is the fastest way to bulk load a file (created by Postgres) into an in-memory SQLite database using Python 3?

1 Answers1