How to speed up the import of csv to sql?

Question

This is CS50 Web Project1 stage where I have to import books.csv (containing isbn, title, author and year) with 5000 lines in it. The problem is that the import itself takes too long (about 10 lines per second) and it's not normal I suppose. How do I speed it up?

I have created a table with isbn, title, author and year rows all in varchar. I use postgesql. Next I wrote import.py , which looks like this

import csv
from sqlalchemy import create_engine
from sqlalchemy.orm import scoped_session, sessionmaker


engine = create_engine(os.getenv("DATABASE_URL"))
db = scoped_session(sessionmaker(bind=engine))

def main():
    f = open("books.csv")
    reader = csv.reader(f)
    for ISBN, title, author, year in reader:
        db.execute("INSERT INTO books (ISBN, title, author, year)      VALUES (:ISBN, :title, :author, :year)",
       {"ISBN":ISBN, "title":title, "author":author, "year":year})
    db.commit()

if __name__ == "__main__":
    main()

I expect the import to proceed in less than a minute, but now it's about 30-40 minutes.

Create a transaction and hence insert into the database as a bulk operation. Finally commit the transaction. — jens1o, Aug 15 '19 at 12:07
@jens1o could you please describe this solution in more details? I'm only beginning my path in handling databases) — Konstantin Golubtsov, Aug 15 '19 at 12:29
You should check the [docs](https://docs.sqlalchemy.org/en/13/orm/persistence_techniques.html#bulk-operations) and this [question](https://stackoverflow.com/questions/3659142/bulk-insert-with-sqlalchemy-orm) — ron_g, Aug 15 '19 at 13:51

score 0 · Answer 1 · answered Aug 16 '19 at 19:42

Given the performance you're observing (10 rows/ second), I would guess the latency of a request to your database is rather high (check it with ping). If that's the case, inserting multiple rows in a single query using INSERT .. VALUES (..), (..), (..) should help a lot.

To do that, you have to:

Pass a list of values to insert to execute: db.execute(sql_query, list_of_dicts_here)
Assuming you use psycopg2 to connect to Postgres, you need to tell sqlalchemy to use psycopg2's "fast execution helpers" by passing executemany_mode='values' to create_engine

If your file was larger, I would suggest to look into using "COPY FROM" bulk loads, but there's no point doing that with only 5000 rows.

How to speed up the import of csv to sql?

1 Answers1