7

I've got 500K rows I want to insert into PostgreSQL using SQLAlchemy.

For speed, I'm inserting them using session.bulk_insert_mappings().

Normally, I'd break up the insert into smaller batches to minimize session bookkeeping. However, bulk_insert_mappings() uses dicts and bypasses a lot of the traditional session bookkeeping.

Will I still see a speed improvement if I break the insert up into smaller discrete batches, say doing an insert every 10K rows?

If so, should I close the PG transaction after every 10K rows, or leave it open the whole time?

Jeff Widman
  • 22,014
  • 12
  • 72
  • 88

1 Answers1

8

In my experience, you'll see substantial performance improvements if you use INSERT INTO tbl (column1, column2) VALUES (...), (...), ...; as opposed to bulk_insert_mappings, which uses executemany. In this case you'll want to batch the rows at least on a statement level for sanity.

SQLAlchemy supports generating a multi-row VALUES clause for a single INSERT statement, so you don't have to hand-craft the statement.

Committing between batches probably won't have much of an effect on the performance, but the reason to do it would be to not keep an open transaction for too long, which could impact other transactions running on the server.

You can also experiment with using COPY to load it into a temporary table, then INSERTing from that table.

Jeff Widman
  • 22,014
  • 12
  • 72
  • 88
univerio
  • 19,548
  • 3
  • 66
  • 68
  • My understanding from reading the docs is `executemany` emits the same SQL as `INSERT INTO tbl VALUES (...), (...), ...;`... is that incorrect? – Jeff Widman Apr 27 '16 at 01:22
  • 1
    @JeffWidman Right, that's incorrect. For psycopg2 at least, `.executemany` is just a glorified version of `.execute` in a loop. See [this question](http://stackoverflow.com/a/10147451/341730). – univerio Apr 27 '16 at 01:24
  • 1
    Oh wow, totally didn't realize that. Leaking abstractions showing up again. Thx. – Jeff Widman Apr 27 '16 at 01:27