31

I have two tables Foo and Bar. I just added a new column x to the Bar table which has to be populated using values in Foo

class Foo(Base):
    __table__ = 'foo'
    id = Column(Integer, primary_key=True)
    x = Column(Integer, nullable=False)

class Bar(Base):
    __table__ = 'bar'
    id = Column(Integer, primary_key=True)
    x = Column(Integer, nullable=False)
    foo_id = Column(Integer, ForeignKey('foo.id'), nullable=False)

One straightforward way to do it would be iterating over all the rows in Bar and then updating them one-by-one, but it takes a long time (there are more than 100k rows in Foo and Bar)

for b, foo_x in session.query(Bar, Foo.x).join(Foo, Foo.id==Bar.foo_id):
    b.x = foo_x
session.flush()

Now I was wondering if this would be right way to do it -

mappings = []
for b, foo_x in session.query(Bar, Foo.x).join(Foo, Foo.id==Bar.foo_id):
    info = {'id':b.id, 'x': foo_x}
    mappings.append(info)
session.bulk_update_mappings(Bar, mappings)

There are not much examples on bulk_update_mappings out there. The docs suggest

All those keys which are present and are not part of the primary key are applied to the SET clause of the UPDATE statement; the primary key values, which are required, are applied to the WHERE clause.

So, in this case id will be used in the WHERE clause and then that would be updates using the x value in the dictionary right ?

Lyman Zerga
  • 1,415
  • 3
  • 19
  • 40

2 Answers2

37

The approach is correct in terms of usage. The only thing I would change is something like below

mappings = []
i = 0

for b, foo_x in session.query(Bar, Foo.x).join(Foo, Foo.id==Bar.foo_id):
    info = {'id':b.id, 'x': foo_x}
    mappings.append(info)
    i = i + 1
    if i % 10000 == 0:
        session.bulk_update_mappings(Bar, mappings)
        session.flush()
        session.commit()
        mappings[:] = []

session.bulk_update_mappings(Bar, mappings)

This will make sure you don't have too much data hanging in memory and you don't do a too big insert to the DB at a single time

Tarun Lalwani
  • 142,312
  • 9
  • 204
  • 265
  • Great! I have million data to be updated and there is poor connection between sql and backend. It seems the update process hang forever. Thanks for the suggestion of splitting data. – Junyong Yao Mar 26 '18 at 06:01
  • Is there any way to specify where clause as other fields than primary key? – Mehrdad Aug 17 '19 at 06:52
  • @Mehrdad, i would suggest to open a new query on the same and post a link here if needed. As the problems are totally different – Tarun Lalwani Aug 17 '19 at 07:02
  • @TarunLalwani Good suggestion, I did that: https://stackoverflow.com/questions/57534134/sqlalchemy-bulk-update-with-custom-where-clause – Mehrdad Aug 17 '19 at 07:15
  • 1
    By default, session has autoflush enabled, so there's no need to flush explicitly every 10k records – Taras Jan 18 '22 at 05:00
2

Not directly related to this question, but for those searching for more performance when updating/inserting using both methods: bulk_update_mappings and bulk_insert_mappings, just add the fast_executemany to your engine as follows:

engine = create_engine(connection_string, fast_executemany=True)

You can use that parameter in sqlalchemy versions above 1.3. This parameter comes from pyodbc and it will for sure speed up your bulk requests.