8

I have a for loop that goes through millions of objects. What would be the suggested way to commit this? Here are a few examples I was thinking of:

# after each
for item in items:
    cursor.execute()
    conn.commit()

# at the end
for item in items:
    cursor.execute()
conn.commit()

# after N items
for n, item in enumerate(items):
    cursor.execute()
    if n % N == 0:
        conn.commit()
conn.commit()

Which of the above would be the most efficient?

David542
  • 104,438
  • 178
  • 489
  • 842
  • IMO if you're not using transactions then it's best to directly turn on [autocommit](http://stackoverflow.com/a/384452/2629998) on the DB object. Otherwise commit at the end if you're satisfied with the results, or rollback if something bad happened in the meantime. –  Sep 07 '14 at 18:38
  • Why not try it and find out, on a copy of your actual data set? I'd say one of the latter two. Of course if the program bombs out, you don't want to have to run too many updates again - so the last one seems sensible to me. – halfer Sep 07 '14 at 18:39
  • I'd be inclined to suggest at the end. – hd1 Sep 07 '14 at 18:40

2 Answers2

5

You're asking, should I commit ...

  1. just once at the end of the huge operation
  2. after each row update
  3. after every N items.

First of all, if you're working with MyISAM, don't worry about it. MyISAM doesn't implement transactions.

If your system won't be corrupted by committing only some of your row changes, you should commit after every N items. That's because committing after every item will slow down your processing a lot. Only committing once at the end of the huge operation is likely either to blow out your rollback space, or take an unspeakably long time during which your other data base users will have slowdowns.

I've been through this countless times. Any N greater than about 20 will do fine.

O. Jones
  • 103,626
  • 17
  • 118
  • 172
1

Arguably the "at the end" one, since you BEGIN and COMMIT only once and it's one single transaction with its own scope. It's easier from concurrency point of view: basically the transaction says: this is my table now, don't touch it anyone.

If you commit multiple times (the other two solutions), you BEGIN and COMMIT a transaction many times (there's a next transaction started right after you commit). This means more chance of interruption from other ongoing DB operations. Also - those operations take time themselves.

You should run a benchmark simulating your use-case, though. I'd be curious to know if depending on some conditions (number of rows to insert, session configuration, data type, indexes used) one or the other solution may prevail.

PawelP
  • 1,150
  • 10
  • 11
  • Just a guess - committing only once might expand the pending writes buffer past its memory limits though, which is why I would favour regular commits. I don't know enough about MySQL however to take a view on that. You're right about benchmarking! – halfer Sep 07 '14 at 18:55
  • Yeah, that's the obvious problem with a massive commit. It really depends what data you're inserting - millions of rows for a simple association table (INT, INT) is n_millions*8 MB - that's not a lot. Also, a note: I only have experience with InnoDB. – PawelP Sep 07 '14 at 18:58