Python+MySQL - Bulk Insert

Question

I'm working with the MySQLdb module in Python to interact with a database. I have a situation where there is a very large list (tens of thousands of elements) which I need to insert as rows into a table.

My solution right now is to generate a large INSERT statement as a string and execute it.

Is there a smarter way?

score 20 · Accepted Answer · answered Jun 26 '11 at 04:10

There is a smarter way.

The problem with bulk insertions is that by default autocommit is enabled thus causing each insert statement to be saved to stable store before the next insert can initiate.

As the manual page notes:

By default, MySQL runs with autocommit mode enabled. This means that as soon as you execute a statement that updates (modifies) a table, MySQL stores the update on disk to make it permanent. To disable autocommit mode, use the following statement:

SET autocommit=0;

After disabling autocommit mode by setting the autocommit variable to zero, changes to transaction-safe tables (such as those for InnoDB, BDB, or NDBCLUSTER) are not made permanent immediately. You must use COMMIT to store your changes to disk or ROLLBACK to ignore the changes.

This is a pretty common feature of RDBMs systems which presume that database integrity is paramount. It does make bulk inserts take on the order of 1s per insert instead of 1ms. The alternative of making an overlarge insert statement tries to achieve this single commit at risk of overloading the SQL parser.

Starting with 1.2.0, MySQLdb disables autocommit by default, as required by the DB-API standard (PEP-249). source: http://mysql-python.sourceforge.net/FAQ.html — mikewaters, Jan 16 '12 at 23:59
If you feel that your inserts are still slower than they should be, make sure to also tweak your mysql server settings. In my case my `innodb_buffer_pool_size` was way too small for my transaction sizes and by raising it I achieved a +40% speedup for bulk inserts. See: https://dev.mysql.com/doc/refman/5.7/en/innodb-buffer-pool.html — jlh, Jan 10 '18 at 21:10

score 14 · Answer 2 · answered Jun 26 '11 at 07:02

14

If you have to insert very large amount of data why are you trying to insert all of them in one single insert? (This will unecessary put load on your memory in making this large insert string and also while executing it. Also this isn't a very good solution if your data to be inserted is very very large.)

Why don't you put one row per insert command in the db and put all the rows using a for...loop and commit all the changes in the end?

con = mysqldb.connect(
                        host="localhost",
                        user="user",
                        passwd="**",
                        db="db name"
                     )
cur = con.cursor()

for data in your_data_list:
    cur.execute("data you want to insert: %s" %data)

con.commit()
con.close()

(Believe me, this is really fast but if you are getting slower results then it means your autocommit must be True. Set it to False as msw says.)

answered Jun 26 '11 at 07:02

Pushpak Dagade

6,280
7
28
41

1

If there is a statement for each insertion, wouldn't it be very slow? I don't mind using memory. It will only be megabytes so I'm not concerned. – Mike Jun 26 '11 at 08:50
no it won`t be slow that is what I want to say... as long as you are not commiting in between the loop. Try both and see if you don't believe me... – Pushpak Dagade Jun 26 '11 at 09:12
In MyISAM (the engine I'm using) isn't committing done implicitly after executions? – Mike Jun 26 '11 at 09:46
2

I don't know about that as I don't use MYISAM, but if it is so, you can turn off commiting by executing this line `SET autocommit = 0;` from the mysql shell (or to do so from within the your python program use `cur.execute('SET autocommit = 0'); con.commit()`) – Pushpak Dagade Jun 26 '11 at 10:31
1

I realize this is very old and maybe at time it was written this wasn't the case but the doc suggests using multi-row inserts to cut down on communication overhead. http://dev.mysql.com/doc/refman/5.5/en/optimizing-innodb-bulk-data-loading.html – Dean MacGregor Sep 18 '14 at 03:02
1

Just a quick side note: When wanting to add data to the insert statement like so: `for data in your_data_list: cur.execute("data you want to insert: %s" %data)`, opens the flood gates to SQL injection attacks. Its best to use the following: `for data in your_data_list: cur.execute("data you want to insert: %s",item 1, item 2)` (each comma after the SQL statement string would replace a `?` in the SQL string with the item) See https://mkleehammer.github.io/pyodbc/ under 'Insert Data' Happy Coding! – Colby Oct 10 '16 at 15:57

score 1 · Answer 3 · answered Jun 26 '11 at 03:53

1

As long as you're doing it as a single INSERT and not thousands of individual ones, then yes this is the best way to do it. Watch out for not exceeding mysqls's max packet size, and adjust it if necessary. For example this sets the server packet max to 32Mb. You need to do the same on the client too.

mysqld --max_allowed_packet=32M

answered Jun 26 '11 at 03:53

justinhj

11,147
11
58
104

this is a hack around the transaction mechanism which addresses the symptom and not the cause – msw Jun 26 '11 at 04:12
Could you expand a bit on that? Or say what you would do instead? – justinhj Jun 27 '11 at 23:23

Python+MySQL - Bulk Insert

3 Answers3

Linked