1

I have a large dataset I want to load into a SQLite in-memory database. I plan on loading the data from a file exported from Postgres. What file format and load mechanism is the fastest?

Currently I'm considering the following two options:

Which is faster? Is there a third faster option, maybe?

This will be done as part of a Python 3 script. Does that affect the choice?

If nobody has any experience with this, I'll make sure to post benchmarks as an answer later.

Edit: This question has gotten a downvote. From the comments it seems this is due to the lack of benchmarking. If not, please let me know how to improve this question. I definitely don't expect anybody to perform benchmarking for me. I'm simply hoping that someone has prior experience with bulk loading into SQLite.

André C. Andersen
  • 8,955
  • 3
  • 53
  • 79
  • 1
    What are the results of your measurements? – CL. Jun 20 '17 at 15:41
  • @JLH I'm hoping someone with experience bulk loading into SQLite had an answer before I went to the trouble of double implementing the solution for testing purposes. If I don't hear from anybody I'll do the tests and post them here as an answer. I definitely don't want anybody to go to the trouble of doing benchmarks for me. I was only hoping that someone might have done this before. I've added the python 3 tag, thanks. – André C. Andersen Jun 20 '17 at 15:56
  • `copy` is as fast as it gets –  Jun 20 '17 at 18:00

1 Answers1

1

Turns out there is no great way to do this using pg_dump and insert statements in a performant way. We end up with inserting line-by-line from the source file both when we use the CSV and the pg_dump strategies. We're going with the CSV method loading 10000 rows each batch using executemany.

import sqlite3
from datetime import datetime
import csv

conn = sqlite3.connect(":memory:")
cur = conn.cursor()
create_query = """
    CREATE VIRTUAL TABLE my_table USING fts4(
        id INTEGER,
        my_field TEXT
    );
"""
cur.execute(create_query)

csv.field_size_limit(2147483647)

from_time = datetime.now()
with open('test.csv', 'r', encoding="utf8") as file:
    csv_file = csv.reader(file)
    header = next(csv_file)
    query_template = """
        INSERT INTO my_table (id, my_field) 
        VALUES (?, ?);
    """
    for batch in split_iterable_by_size(csv_file, 10000):
        cur.executemany(query_template, batch)
        conn.commit()  

On our system and dataset this took 2 hours 30 minutes. We're not testing the alternative.

André C. Andersen
  • 8,955
  • 3
  • 53
  • 79