208

Is there any way to get SQLAlchemy to do a bulk insert rather than inserting each individual object. i.e.,

doing:

INSERT INTO `foo` (`bar`) VALUES (1), (2), (3)

rather than:

INSERT INTO `foo` (`bar`) VALUES (1)
INSERT INTO `foo` (`bar`) VALUES (2)
INSERT INTO `foo` (`bar`) VALUES (3)

I've just converted some code to use sqlalchemy rather than raw sql and although it is now much nicer to work with it seems to be slower now (up to a factor of 10), I'm wondering if this is the reason.

May be I could improve the situation using sessions more efficiently. At the moment I have autoCommit=False and do a session.commit() after I've added some stuff. Although this seems to cause the data to go stale if the DB is changed elsewhere, like even if I do a new query I still get old results back?

Thanks for your help!

plaes
  • 31,788
  • 11
  • 91
  • 89
Nick Holden
  • 3,639
  • 3
  • 22
  • 12
  • 1
    This might help: http://stackoverflow.com/questions/270879/efficiently-updating-database-using-sqlalchemy-orm/278606#278606 – Sean Vieira Sep 07 '10 at 16:45
  • 1
    Nick, I understand this is a *very* old post. Would it be possible to update the title to something *correct* like "multiple record insert with SQLAlchemy ORM". Multi-record insert statements like the one you've provided are quite different from bulk-loading operations at the database level. Bulk inserts are intended for 1k+ data uploads, usually from large datasets and done by application managers, not REST operations or application level code.... Let's use our nomenclature properly. – W4t3randWind Nov 09 '17 at 17:43
  • For those who stumble upon this question while looking for information about bulk operations in sqlalchemy *Core* (not ORM), see [my answer to another question](https://stackoverflow.com/a/57529830/1026). – Nickolay Aug 16 '19 at 19:02

11 Answers11

275

SQLAlchemy introduced that in version 1.0.0:

Bulk operations - SQLAlchemy docs

With these operations, you can now do bulk inserts or updates!

For instance, you can do:

s = Session()
objects = [
    User(name="u1"),
    User(name="u2"),
    User(name="u3")
]
s.bulk_save_objects(objects)
s.commit()

Here, a bulk insert will be made.

bsplosion
  • 2,641
  • 27
  • 38
Pierre
  • 2,764
  • 1
  • 12
  • 6
  • 47
    You also need s.commit() to actually save the records (it took me a bit to figure this one out). – horcle_buzz Dec 04 '15 at 01:30
  • 7
    I tried this with sqlachemy 1.0.11 and it still makes 3 insert statements. But it is a lot faster than normal orm operations. – zidarsk8 Oct 07 '16 at 21:12
  • 3
    while not pertinent to the OPs question, it is worth mentioning this does break certain features of the ORM. http://docs.sqlalchemy.org/en/rel_1_0/orm/persistence_techniques.html#orm-compatibility – dangel Jun 28 '17 at 03:05
  • @dangel yes thank you for posting this. Although OP's title concerns "bulk loading" his question about multi-record insert statements has nothing to do with sqlalchemy's bulk loading feature. – W4t3randWind Nov 09 '17 at 17:46
  • Compared to inserting the same data from CSV with `\copy` with psql (from the same client to the same server), I see a huge difference in performance *on the server side* resulting in about 10x more inserts/s. Apparently is bulk-loading using `\copy` (or `COPY` on the server) using a packing in communicating from client-to-server a LOT better than using SQL via SQLAlchemy. More info: [Large bulk insert performance difference PostgreSQL vs ...](https://dba.stackexchange.com/a/242338/13155). – gertvdijk Jul 12 '19 at 16:42
  • Do we actually need the s.commit() here to add to the db? The official sqlalchemy docs here https://docs.sqlalchemy.org/en/13/orm/session_api.html?highlight=bulk_save_object#sqlalchemy.orm.session.Session.bulk_save_objects doesn't explicitly suggest commiting db. – Soma Siddhartha Mar 16 '22 at 12:45
81

The sqlalchemy docs have a writeup on the performance of various techniques that can be used for bulk inserts:

ORMs are basically not intended for high-performance bulk inserts - this is the whole reason SQLAlchemy offers the Core in addition to the ORM as a first-class component.

For the use case of fast bulk inserts, the SQL generation and execution system that the ORM builds on top of is part of the Core. Using this system directly, we can produce an INSERT that is competitive with using the raw database API directly.

Alternatively, the SQLAlchemy ORM offers the Bulk Operations suite of methods, which provide hooks into subsections of the unit of work process in order to emit Core-level INSERT and UPDATE constructs with a small degree of ORM-based automation.

The example below illustrates time-based tests for several different methods of inserting rows, going from the most automated to the least. With cPython 2.7, runtimes observed:

classics-MacBook-Pro:sqlalchemy classic$ python test.py
SQLAlchemy ORM: Total time for 100000 records 12.0471920967 secs
SQLAlchemy ORM pk given: Total time for 100000 records 7.06283402443 secs
SQLAlchemy ORM bulk_save_objects(): Total time for 100000 records 0.856323003769 secs
SQLAlchemy Core: Total time for 100000 records 0.485800027847 secs
sqlite3: Total time for 100000 records 0.487842082977 sec

Script:

import time
import sqlite3

from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy import Column, Integer, String,  create_engine
from sqlalchemy.orm import scoped_session, sessionmaker

Base = declarative_base()
DBSession = scoped_session(sessionmaker())
engine = None


class Customer(Base):
    __tablename__ = "customer"
    id = Column(Integer, primary_key=True)
    name = Column(String(255))


def init_sqlalchemy(dbname='sqlite:///sqlalchemy.db'):
    global engine
    engine = create_engine(dbname, echo=False)
    DBSession.remove()
    DBSession.configure(bind=engine, autoflush=False, expire_on_commit=False)
    Base.metadata.drop_all(engine)
    Base.metadata.create_all(engine)


def test_sqlalchemy_orm(n=100000):
    init_sqlalchemy()
    t0 = time.time()
    for i in xrange(n):
        customer = Customer()
        customer.name = 'NAME ' + str(i)
        DBSession.add(customer)
        if i % 1000 == 0:
            DBSession.flush()
    DBSession.commit()
    print(
        "SQLAlchemy ORM: Total time for " + str(n) +
        " records " + str(time.time() - t0) + " secs")


def test_sqlalchemy_orm_pk_given(n=100000):
    init_sqlalchemy()
    t0 = time.time()
    for i in xrange(n):
        customer = Customer(id=i+1, name="NAME " + str(i))
        DBSession.add(customer)
        if i % 1000 == 0:
            DBSession.flush()
    DBSession.commit()
    print(
        "SQLAlchemy ORM pk given: Total time for " + str(n) +
        " records " + str(time.time() - t0) + " secs")


def test_sqlalchemy_orm_bulk_insert(n=100000):
    init_sqlalchemy()
    t0 = time.time()
    n1 = n
    while n1 > 0:
        n1 = n1 - 10000
        DBSession.bulk_insert_mappings(
            Customer,
            [
                dict(name="NAME " + str(i))
                for i in xrange(min(10000, n1))
            ]
        )
    DBSession.commit()
    print(
        "SQLAlchemy ORM bulk_save_objects(): Total time for " + str(n) +
        " records " + str(time.time() - t0) + " secs")


def test_sqlalchemy_core(n=100000):
    init_sqlalchemy()
    t0 = time.time()
    engine.execute(
        Customer.__table__.insert(),
        [{"name": 'NAME ' + str(i)} for i in xrange(n)]
    )
    print(
        "SQLAlchemy Core: Total time for " + str(n) +
        " records " + str(time.time() - t0) + " secs")


def init_sqlite3(dbname):
    conn = sqlite3.connect(dbname)
    c = conn.cursor()
    c.execute("DROP TABLE IF EXISTS customer")
    c.execute(
        "CREATE TABLE customer (id INTEGER NOT NULL, "
        "name VARCHAR(255), PRIMARY KEY(id))")
    conn.commit()
    return conn


def test_sqlite3(n=100000, dbname='sqlite3.db'):
    conn = init_sqlite3(dbname)
    c = conn.cursor()
    t0 = time.time()
    for i in xrange(n):
        row = ('NAME ' + str(i),)
        c.execute("INSERT INTO customer (name) VALUES (?)", row)
    conn.commit()
    print(
        "sqlite3: Total time for " + str(n) +
        " records " + str(time.time() - t0) + " sec")

if __name__ == '__main__':
    test_sqlalchemy_orm(100000)
    test_sqlalchemy_orm_pk_given(100000)
    test_sqlalchemy_orm_bulk_insert(100000)
    test_sqlalchemy_core(100000)
    test_sqlite3(100000)
Grant Humphries
  • 2,696
  • 2
  • 23
  • 24
  • I saw another example using bindparams. The syntax looks succinct, is that any good? – Jay Feb 19 '20 at 14:24
35

As far as I know, there is no way to get the ORM to issue bulk inserts. I believe the underlying reason is that SQLAlchemy needs to keep track of each object's identity (i.e., new primary keys), and bulk inserts interfere with that. For example, assuming your foo table contains an id column and is mapped to a Foo class:

x = Foo(bar=1)
print x.id
# None
session.add(x)
session.flush()
# BEGIN
# INSERT INTO foo (bar) VALUES(1)
# COMMIT
print x.id
# 1

Since SQLAlchemy picked up the value for x.id without issuing another query, we can infer that it got the value directly from the INSERT statement. If you don't need subsequent access to the created objects via the same instances, you can skip the ORM layer for your insert:

Foo.__table__.insert().execute([{'bar': 1}, {'bar': 2}, {'bar': 3}])
# INSERT INTO foo (bar) VALUES ((1,), (2,), (3,))

SQLAlchemy can't match these new rows with any existing objects, so you'll have to query them anew for any subsequent operations.

As far as stale data is concerned, it's helpful to remember that the session has no built-in way to know when the database is changed outside of the session. In order to access externally modified data through existing instances, the instances must be marked as expired. This happens by default on session.commit(), but can be done manually by calling session.expire_all() or session.expire(instance). An example (SQL omitted):

x = Foo(bar=1)
session.add(x)
session.commit()
print x.bar
# 1
foo.update().execute(bar=42)
print x.bar
# 1
session.expire(x)
print x.bar
# 42

session.commit() expires x, so the first print statement implicitly opens a new transaction and re-queries x's attributes. If you comment out the first print statement, you'll notice that the second one now picks up the correct value, because the new query isn't emitted until after the update.

This makes sense from the point of view of transactional isolation - you should only pick up external modifications between transactions. If this is causing you trouble, I'd suggest clarifying or re-thinking your application's transaction boundaries instead of immediately reaching for session.expire_all().

dhaffey
  • 1,354
  • 9
  • 12
  • Thanks for your reply, I'm going to give that a go. WRT the expiring issue, what I saw wasn't quite the same. I am using a scoped session in turbogears. Performing a getSession().query(Foo).filter....all() returned different things depending on the request, also did not return the updated records that were in the db until I restarted it. I fixed this issue by doing a autocommit=True and adding in something that .remove()d the session after the request was complete (I gather you are meant to do that anyway). – Nick Holden Sep 08 '10 at 08:22
  • I guess it returned different things depending on the request because it had a scoped session per thread in the pool and the sessions were in different states? It seemed a bit odd that sa wouldn't get new data after a fresh request though. I expect I am missunderstanding what autocommit=False is doing – Nick Holden Sep 08 '10 at 08:33
  • With `autocommit=False`, I believe you should be calling `session.commit()` upon request completion (I'm not familiar with TurboGears, so ignore this if that's handled for you at the framework level). Besides making sure your changes have made it to the database, this would expire everything in the session. The next transaction wouldn't begin until the next use of that session, so future requests on the same thread wouldn't see stale data. – dhaffey Sep 08 '10 at 17:22
  • 10
    Alternative style: `session.execute(Foo.__table__.insert(), values)` – Joril Feb 03 '13 at 17:11
  • 8
    Note that newer versions of sqlalchemy have bulk insert capabilities: http://docs.sqlalchemy.org/en/latest/orm/persistence_techniques.html#bulk-operations – Wayne Werner Feb 03 '16 at 16:52
29

I usually do it using add_all.

from app import session
from models import User

objects = [User(name="u1"), User(name="u2"), User(name="u3")]
session.add_all(objects)
session.commit()
reubano
  • 5,087
  • 1
  • 42
  • 41
  • 3
    Are you sure this works? It doesn't just do the equivalent of `.add`ing them to the session one at a time? – alecbz Apr 18 '18 at 00:59
  • That would be counter intuitive given the method name, the docs don't go into detail: `Add the given collection of instances to this Session.` Do you have any reason to believe it doesn't do a bulk insert? – reubano Apr 18 '18 at 09:20
  • 3
    I don't think it's too counterintuitive -- it does in fact _add_ _all_ of the things you ask it to. Nothing about adding all the things to the session seems like it would imply what underlying SQL statements get issued. Looking at the source: https://github.com/zzzeek/sqlalchemy/blob/ea36338b2e5f621287e9890deeffb6c5f6ff0874/lib/sqlalchemy/orm/session.py#L1784 it does in fact seem to just `.add` each item individually. – alecbz Apr 18 '18 at 16:46
  • It works well, compared to `bulk_save_objects()`, with a `flush()`, we can get the object's ID, but `bulk_save_objects()` can't (event with `flush()` called). – coanor Sep 19 '18 at 06:36
  • @reubano `add_all` issues individual `INSERT INTO`. This is the _opposite_ of what the question is asking for. You don't have to believe me, just check the query log. – uKolka Jun 21 '23 at 16:52
24

Direct support was added to SQLAlchemy as of version 0.8

As per the docs, connection.execute(table.insert().values(data)) should do the trick. (Note that this is not the same as connection.execute(table.insert(), data) which results in many individual row inserts via a call to executemany). On anything but a local connection the difference in performance can be enormous.

user3805082
  • 2,076
  • 1
  • 14
  • 9
  • This was a lifesaver because I've been needing to properly create a single bulk UPDATE for use with the sqlalchemy-bigquery dialect and nothing worked until I tried this. Thanks – CaffeinatedMike Aug 09 '22 at 00:28
  • My experience was the opposite. Inserting 215k rows with 132 columns, session.execute(cls.__table__.insert(), values) finished in a couple minutes while session.execute(cls.__table__.insert().values(values)) hogged the CPU and crashed. – James Bridgewater Feb 22 '23 at 05:21
20

SQLAlchemy introduced that in version 1.0.0:

Bulk operations - SQLAlchemy docs

With these operations, you can now do bulk inserts or updates!

For instance (if you want the lowest overhead for simple table INSERTs), you can use Session.bulk_insert_mappings():

loadme = [(1, 'a'),
          (2, 'b'),
          (3, 'c')]
dicts = [dict(bar=t[0], fly=t[1]) for t in loadme]

s = Session()
s.bulk_insert_mappings(Foo, dicts)
s.commit()

Or, if you want, skip the loadme tuples and write the dictionaries directly into dicts (but I find it easier to leave all the wordiness out of the data and load up a list of dictionaries in a loop).

Michael
  • 7,316
  • 1
  • 37
  • 63
juanitogan
  • 1,698
  • 1
  • 22
  • 37
  • 1
    Dear @juanitogan, would you be so kind as to show an example of how you created this Foo object (or would it be a Class)? Would there be a case scenario in which one does not have to create the whole Foo class? Instead, just download it directly from an existing Table from a remote Database? If so, could you elaborate an example of its usage? – Philipe Riskalla Leal Mar 18 '21 at 19:05
  • @Philipe Riskalla Leal Foo here is the table class i.e class Foo: bar = Column(Integer) fly = Column(String(200)) – sunny Jan 05 '22 at 03:31
9

Piere's answer is correct but one issue is that bulk_save_objects by default does not return the primary keys of the objects, if that is of concern to you. Set return_defaults to True to get this behavior.

The documentation is here.

foos = [Foo(bar='a',), Foo(bar='b'), Foo(bar='c')]
session.bulk_save_objects(foos, return_defaults=True)
for foo in foos:
    assert foo.id is not None
session.commit()
Matthew Moisen
  • 16,701
  • 27
  • 128
  • 231
  • 3
    A caution must taken with the flag. It will insert one object at a time sequentially and the significant performance gain may not be there [1]. In my case, the performance degraded which I suspected due to the overhead. [1]: https://docs.sqlalchemy.org/en/13/orm/session_api.html#sqlalchemy.orm.session.Session.bulk_save_objects.params.return_defaults – dhfromkorea Jul 23 '19 at 20:28
6

This is a way:

values = [1, 2, 3]
Foo.__table__.insert().execute([{'bar': x} for x in values])

This will insert like this:

INSERT INTO `foo` (`bar`) VALUES (1), (2), (3)

Reference: The SQLAlchemy FAQ includes benchmarks for various commit methods.

RobotHumans
  • 807
  • 10
  • 25
Eefret
  • 4,724
  • 4
  • 30
  • 46
6

All Roads Lead to Rome, but some of them crosses mountains, requires ferries but if you want to get there quickly just take the motorway.


In this case the motorway is to use the execute_batch() feature of psycopg2. The documentation says it the best:

The current implementation of executemany() is (using an extremely charitable understatement) not particularly performing. These functions can be used to speed up the repeated execution of a statement against a set of parameters. By reducing the number of server roundtrips the performance can be orders of magnitude better than using executemany().

In my own test execute_batch() is approximately twice as fast as executemany(), and gives the option to configure the page_size for further tweaking (if you want to squeeze the last 2-3% of performance out of the driver).

The same feature can easily be enabled if you are using SQLAlchemy by setting use_batch_mode=True as a parameter when you instantiate the engine with create_engine()

chjortlund
  • 3,613
  • 2
  • 31
  • 30
  • Note: psycopg2's `execute_values` is **faster** than psycopg2's `execute_batch` when doing bulk inserts! – Fierr Apr 05 '20 at 14:37
4

The best answer I found so far was in sqlalchemy documentation:

http://docs.sqlalchemy.org/en/latest/faq/performance.html#i-m-inserting-400-000-rows-with-the-orm-and-it-s-really-slow

There is a complete example of a benchmark of possible solutions.

As shown in the documentation:

bulk_save_objects is not the best solution but it performance are correct.

The second best implementation in terms of readability I think was with the SQLAlchemy Core:

def test_sqlalchemy_core(n=100000):
    init_sqlalchemy()
    t0 = time.time()
    engine.execute(
        Customer.__table__.insert(),
            [{"name": 'NAME ' + str(i)} for i in xrange(n)]
    )

The context of this function is given in the documentation article.

lelabo_m
  • 509
  • 8
  • 21
0

Sqlalchemy supports bulk insert

bulk_list = [
    Foo(
        bar=1,
    ),
    Foo(
        bar=2,
    ),
    Foo(
        bar=3,
    ),
]
db.session.bulk_save_objects(bulk_list)
db.session.commit()
Hashir Irfan
  • 315
  • 1
  • 9