5

I'm going through this article on the sqlalchemy bulk insert performance. I tried various approaches specified in the benchmark test - SQLAlchemy ORM bulk_insert_mappings(), SQLAlchemy Core. Unfortunately for inserting 1000 rows all these methods required about 1min to insert them. This is horrendously slow. I tried also the approach specified here - this requires me building a large SQL statement like:

INSERT INTO mytable (col1, col2, col3)
VALUES (1,2,3), (4,5,6) ..... --- up to 1000 of these

And the insert for this raw SQL is something like:

MySession.execute('''
insert into MyTable (e, l, a)
values {}
'''.format(",".join(my_insert_str)))

Using this approach I improved the performance 50x+ times to 10000 insertions in 10-11 seconds.

Here is the code for the approach using the build-in lib.

class MyClass(Base):
    __tablename__ = "MyTable"
    e = Column(String(256), primary_key=True)
    l = Column(String(6))
    a = Column(String(20), primary_key=True)

    def __repr__(self):
        return self.e + " " + self.a+ " " + self.l

.......

        dict_list = []
        for i, row in chunk.iterrows():

            dict_list += [{"e" : row["e"], "l" : l, "a" : a}]

        MySession.execute(
            Myclass.__table__.insert(),
            dict_list
        )

Here is how I connect to the database.

    params = urllib.quote_plus("DRIVER={SQL Server Native Client 10.0};SERVER=servername;DATABASE=dbname;UID=user;PWD=pass")
    engine = create_engine("mssql+pyodbc:///?odbc_connect=%s" % params )
    MySession.configure(bind=engine, autoflush=False, expire_on_commit=False)

Is there an issue with my set up to degrade the performance so much? I tried with different db drivers - pyodbc and pymssql. What ever I try I cannot any close to the numbers they claim in the article namely:

SQLAlchemy ORM: Total time for 100000 records 2.192882061 secs
SQLAlchemy ORM pk given: Total time for 100000 records 1.41679310799 secs
SQLAlchemy ORM bulk_save_objects(): Total time for 100000 records 0.494568824768 secs
SQLAlchemy ORM bulk_insert_mappings(): Total time for 100000 records 0.325763940811 secs
SQLAlchemy Core: Total time for 100000 records 0.239127874374 secs
sqlite3: Total time for 100000 records 0.124729156494 sec

I'm connecting to MS SQL Server 2008. Let me know if I've missed any other details.

The problem with the raw SQL approach is that it's not SQL injection safe. So alternatively if you have suggestions how to solve this issue it will be also very helpful :).

Community
  • 1
  • 1
Anton Belev
  • 11,963
  • 22
  • 70
  • 111
  • I would suggest you to read : http://seldo.com/weblog/2011/08/11/orm_is_an_antipattern , just for mention ;) – Cédric Julien Aug 03 '17 at 12:30
  • I'd much rather suggest reading https://github.com/mkleehammer/pyodbc/issues/120, a long standing issue about pyodbc and extremely slow `executemany()` performance for large inserts. ORMs have their place, though bulk data handling is not one of them. – Ilja Everilä Aug 03 '17 at 19:18

1 Answers1

13

You're doing

MySession.execute(
    Myclass.__table__.insert(),
    dict_list
)

which uses executemany(). It is not the same as INSERT INTO ... VALUES .... To use VALUES, do:

MySession.execute(
    Myclass.__table__.insert().values(dict_list)
)

As a side note, the SQL injection problem is solved using parameters:

MySession.execute('''
insert into MyTable (e, l, a)
values (?, ?, ?), (?, ?, ?), ...
''', params)

The takeaway here is that you're not comparing equivalent constructs. You're not using VALUES in the SQLAlchemy-generated query but you are in your textual SQL, and you're not using parameterization in your textual SQL but you are in the SQLAlchemy-generated query. If you turn on logging for the executed SQL statements you'll see exactly what is different.

univerio
  • 19,548
  • 3
  • 66
  • 68
  • 1
    Not saying it applies to all cases, esp. here, but the compiling of an `insert().values()` [proved to take even longer in one case](https://stackoverflow.com/questions/43317376/how-can-i-use-psycopg2-extras-in-sqlalchemy#comment73706472_43317376). The execution itself was faster as expected. Should look more into that. – Ilja Everilä Aug 03 '17 at 19:07
  • 1
    @IljaEverilä That's a good point. For large datasets like that, I tend to stream a CSV into `COPY` instead, like your answer, just without generating the entire file in memory. For ~1k entries though, the overhead of using `VALUES` should be relatively small. – univerio Aug 03 '17 at 19:14
  • Agreed. It seems pyodbc has [issues with `executemany()` and several kilo rows](https://github.com/mkleehammer/pyodbc/issues/120). – Ilja Everilä Aug 03 '17 at 19:20
  • @univerio thanks for the answer I actually ended up using the parameters to prevent the SQL Injection and so far I think this gives me the best performance. Good to see that you also suggested this! I will check again the .insert().values(dict_list) although I'm a bit sceptical about any build in approaches now :D – Anton Belev Aug 04 '17 at 08:53
  • It is still strange how the performance article obtained <1sec times for 100,000 insertions and I cannot get anywhere near 10 secs for 10,000. – Anton Belev Aug 04 '17 at 08:54
  • @AntonBelev Possibly the sqlite3 driver used is more performant than pyodbc, which seems to have an implementation issue that slows down 10k+ inserts to a crawl (if using executemany). – Ilja Everilä Aug 04 '17 at 09:55
  • @IljaEverilä I think u remember me from a similar question, we discussed this problem in chat as well, I was trying to bulk upsert a very large number of records. I am using this same `insert().values(dict_list)` way but my insertion is still extremely slower than raw SQL execution. Is there a way to fix that? – saadi Jan 28 '20 at 10:09