2

I'm trying to truncate a table and insert only ~3000 rows of data using SQLAlchemy, and it's very slow (~10 minutes).

I followed the recommendations on this doc and leveraged sqlalchemy core to do my inserts, but it's still running very very slow. What are possible culprits for me to look at? Database is a postgres RDS instance. Thanks!

engine = sa.create_engine(db_string, **kwargs, pool_recycle=3600)
with engine.begin() as conn:
            conn.execute("TRUNCATE my_table")
            conn.execute(
                MyTable.__table__.insert(),
                data #where data is a list of dicts
            )
ResrieC
  • 583
  • 1
  • 5
  • 6
  • do you have a primary key? is this alot of cross referenced tables? 10 minutes to insert 3k rows seems very excessive ... how big is a row?you database is running remotely(ie not localhost...) (this would definately slow things down... how much, probably depends on network latency)? – Joran Beasley Sep 12 '16 at 17:19
  • Yes, I have a primary key, which I set through sa/alembic as a natural key String of size 100. However I also tried switching it out with a surrogate integer key and had similar performance issues. My row is 6 columns of strings. The DB is remote (in AWS), so the network may have something to do with it then – ResrieC Sep 12 '16 at 17:53
  • 2
    This uses the `executemany` feature, which, for psycopg2, is a glorified `execute` in a `for` loop. See [this question](http://stackoverflow.com/questions/8134602/psycopg2-insert-multiple-rows-with-one-query). – univerio Sep 12 '16 at 22:52

2 Answers2

8

I was bummed when I saw this didn't have an answer... I ran into the exact same problem the other day: Trying to bulk-insert about millions of rows to a Postgres RDS Instance using CORE. It was taking hours.

As a workaround, I ended up writing my own bulk-insert script that generated the raw sql itself:

bulk_insert_str = []
for entry in entry_list:
    val_str = "('{}', '{}', ...)".format(entry["column1"], entry["column2"], ...)
    bulk_insert_str.append(val_str)

engine.execute(
    """
    INSERT INTO my_table (column1, column2 ...)
    VALUES {}
    """.format(",".join(bulk_insert_str))
)

While ugly, this gave me the performance we needed (~500,000 rows/minute)

Did you find a CORE-based solution? If not, hope this helps!

UPDATE: Ended up moving my old script into a spare EC2 instance that we weren't using which actually fixed the slow performance issue. Not sure what your setup is, but apparently there's a network overhead in communicating with RDS from an external (non-AWS) connection.

dizzyf
  • 3,263
  • 19
  • 29
  • 2
    Oh, hmm. I was trying to be safe with my data and use `sqlalchemy.text()` around my `insert`, and then parameterize the values (e.g. `engine.execute(sqlalchemy.text(insert_str), **parameters)`) but it seems like `sqlalchemy.text()` is taking a ton of time... Maybe throwing caution to the wind and just slapping values in there is the way to go? *cringes* (To be clear: not criticizing; it may actually be my best solution. Just scares me a little) – dwanderson Mar 08 '17 at 21:55
  • So the actual problem was the network latency? Or was there an issue with how sqlalchemy was doing the insert? – Chris May 14 '18 at 20:50
  • @Chris To the best of my knowledge it's a network latency issue, which is compounded by the additional overhead of some SQLAlchemy behavior. – dizzyf Aug 16 '18 at 15:58
3

Some time ago I had been struggling with the problem while working in the company, so we had created a library with functions for bulk insert and update. Hope we've taken into account all performance and security concerns. This library is open-sourced and available on PyPI, its name: bulky.

Let me show you some examples of usage:

insert:

import bulky
from your.sqlalchemy.models import Model
from your.sqlalchemy.session import Session

data = [
    {Model.column_float: random()}
    for _ in range(100_000_000)
]

rows_inserted = bulky.insert(
    session=Session,
    table_or_model=Model,
    values_series=data,
    returning=[Model.id, Model.column_float]
)

new_items = {row.id: row.column_float for row in rows_inserted}

update:

import bulky
from your.sqlalchemy.models import ManyToManyTable
from your.sqlalchemy.session import Session

data = [
    {
        ManyToManyTable.fk1: i,
        ManyToManyTable.fk2: j,
        ManyToManyTable.value: i + j,
    }
    for i in range(100_000_000)
    for j in range(100_000_000)
]

rows_updated = bulky.update(
    session=Session,
    table_or_model=ManyToManyTable,
    values_series=data,
    returning=[
        ManyToManyTable.fk1,
        ManyToManyTable.fk2,
        ManyToManyTable.value,],
    reference=[
        ManyToManyTable.fk1,
        ManyToManyTable.fk2,],
)

updated_items = {(row.fk1, row.fk2): row.value for row in rows_updated}

Not sure if links are allowed, so I'll put them under spoiler

Readme and PyPI

Alex Sidorov
  • 99
  • 1
  • 6