peewee with bulk insert is very slow into sqlite db

Question

I'm trying to do a large scale bulk insert into a sqlite database with peewee. I'm using atomic but the performance is still terrible. I'm inserting the rows in blocks of ~ 2500 rows, and due to the SQL_MAX_VARIABLE_NUMBER I'm inserting about 200 of them at a time. Here is the code:

with helper.db.atomic():
   for i in range(0,len(expression_samples),step):
      gtd.GeneExpressionRead.insert_many(expression_samples[i:i+step]).execute()

And the list expression_samples is a list of dictionaries with the appropriate fields for the GeneExpressionRead model. I've timed this loop, and it takes anywhere from 2-8 seconds to execute. I have millions of rows to insert, and the way I have my code written now it will likely take 2 days to complete. As per this post, there are several pragmas that I have set in order to improve performance. This also didn't really change anything for me performance wise. Lastly, as per ~~this test on the peewee github page~~ it should be possible to insert many rows very fast (~50,000 in 0.3364 seconds) but it also seems that the author used raw sql code to get this performance. Has anyone been able to do such a high performance insert using peewee methods?

Edit: Did not realize that the test on peewee's github page was for MySQL inserts. May or may not apply to this situation.

score 5 · Accepted Answer · answered Jun 14 '16 at 05:46

Mobius was trying to be helpful in the comments but there's a lot of misinformation in there.

Peewee creates indexes for foreign keys when you create the table. This happens for all database engines currently supported.
Turning on the foreign key PRAGMA is going to slow things down, why would it be otherwise?
For best performance, do not create any indexes on the table you are bulk-loading into. Load the data, then create the indexes. This is much much less work for the database.
As you noted, disabling auto increment for the bulk-load speeds things up.

Other information:

Use PRAGMA journal_mode=wal;
Use PRAGMA synchronous=0;
Use PRAGMA locking_mode=EXCLUSIVE;

Those are some good settings for loading in a bunch of data. Check the sqlite docs for more info:

http://sqlite.org/pragma.html

Yep, after much searching this is pretty much exactly what I did. Thanks for the help! — themantalope, Jun 14 '16 at 14:53

mobiusklein · Answer 2 · 2016-05-28T17:38:24.473

-1

In all of the documentation where code using atomic appears as a context manager, it's been used as a function. Since it sounds like you're never seeing your code exit the with block, you're probably not seeing an error about not having an __exit__ method.

Can you try with helper.db.atomic():?

atomic() is starting a transaction. Without an open transaction, inserts are much slower because some expensive book keeping has to be done for every write, as opposed to only at the beginning and end.

EDIT

Since the code to start the question was changed, can I have some more information about the table you're inserting into? Is it large, how many indices are there?

Since this is SQLite, you're just writing to a file, but do you know if that file is on a local disk or on a network-mounted drive? I've had issues just like this because I was trying to insert into a database on an NFS.

edited May 28 '16 at 17:38

answered May 28 '16 at 17:14

mobiusklein

1,403
9
12

not sure what you mean by `with helper.db.atomic():`. That's what I have in my code and was stated in the question. – themantalope May 28 '16 at 17:28
Oh I see it now. The missing `()` was a typo. Fixing now. – themantalope May 28 '16 at 17:30
Please see my edited content. Now that the potentially trivial is ruled out, I need to start learning more about your database schema. Are you creating a database for RNA-seq data? GeneExpressionRead sounds like a FastQ or SAM entry. – mobiusklein May 28 '16 at 17:43
Yeah it's RNA seq. Data is already processed and I'm just inserting ints of gene reads. The other data inserted for each row are a foreign keys for the gene marker, tissue and sample. The database is a file on the hard drive, not a network drive – themantalope May 28 '16 at 17:48
Is the `foreign_keys` pragma turned on? Is there an index on the foreign key column and the column on the other table it references? – mobiusklein May 28 '16 at 17:51
I have not changed the foreign keys pragma from default. In addition, I realized that I was assigning the objects representing the foreign keys for the new row being inserted. Would it be faster to insert the integer corresponding to the foreign object's id? – themantalope May 28 '16 at 19:08
Anything you can do to make the ORM layer do less work would be good for performance. Please turn on that pragma, and please make sure both of those columns are indexed. Most RDBMS automatically index those columns, but SQLite does not. – mobiusklein May 28 '16 at 19:10
OK, I'll give that a try and get back to you. Thanks for the suggestions! – themantalope May 28 '16 at 19:20
even with the pragmas and using the int value for the foreign key fields it's still taking about 5 seconds per block to execute. Any other ideas? I think for each table only the model ids are indexed. – themantalope May 28 '16 at 20:06
If they're really just integers you're inserting with just one foreign key pointing to another table, I'd skeptical it's not something else inside the book-keeping machinery in peewee, but I'm not an expert on its internals. I'd try the same with the FK pragma turned off and see if there is a difference. – mobiusklein May 28 '16 at 20:15
Still no luck, even after upgrading my sqlite version. I'm guessing that this is something within peewee. I'll try using peewee with postgres and see if that gives me any better performance. – themantalope May 29 '16 at 19:00
Hmmmm, even with postgre the performance is still very slow. I've been seeing a lot of articles about not using an index when doing bulk inserts as such. Any idea about how to turn off the index while doing large inserts and then build it afterwards using peewee? – themantalope May 29 '16 at 21:45
`peewee` doesn't appear to provide an API for dropping indices, only creating them. You could issue raw SQL `drop index` statements, but at this point, I'd be skeptical of either your table design or your ORM. I have a similar set up In SQLAlchemy and I have no issues with SQLite. – mobiusklein May 30 '16 at 00:17
I figured out what it was. When I was doing the bulk insert initially, I was allowing peewee/sqlite to take care of the indexing and maintaining the primary key for the table. Switching off the `auto_increment` option in the peewee model and manually assigning the `id` value during insertion **significantly** sped up the insertion process. I was able to insert about 150,000,000 rows into a table in about 5 hours vs 48 hours when I let sqlite do the book keeping. – themantalope May 31 '16 at 18:52
Providing a PK does speed up inserts, but it shouldn't be **that** big. I'd suggest opening a ticket about this experience. – mobiusklein May 31 '16 at 21:15

peewee with bulk insert is very slow into sqlite db

2 Answers2