Accelerate bulk insert using Django's ORM?

Question

I'm planning to upload a billion records taken from ~750 files (each ~250MB) to a db using django's ORM. Currently each file takes ~20min to process, and I was wondering if there's any way to accelerate this process.

I've taken the following measures:

Use @transaction.commit_manually and commit once every 5000 records
Set DEBUG=False so that django won't accumulate all the sql commands in memory
The loop that runs over records in a single file is completely contained in a single function (minimize stack changes)
Refrained from hitting the db for queries (used a local hash of objects already in the db instead of using get_or_create)
Set force_insert=True in the save() in hopes it will save django some logic
Explicitly set the id in hopes it will save django some logic
General code minimization and optimization

What else can I do to speed things up? Here are some of my thoughts:

Use some kind of Python compiler or version which is quicker (Psyco?)
Override the ORM and use SQL directly
Use some 3rd party code that might be better (1, 2)
Beg the django community to create a bulk_insert function

Any pointers regarding these items or any other idea would be welcome :)

Optimizing the python stuff is almost certainly a waste as almost all your time is being spent in DB calls. Optimization 101, measure to know where your program time is going before you waste YOUR time trying to optimize the wrong things. The biggest gain here will be by using bulk insert queries. — Eloff, Mar 11 '13 at 13:30
I recently did some interesting experiments with django 1.8.5. I think create model is the most time consuming thing, when the number of records reaches 1 million. There are many invisible django checks behind the scene. My solution is use raw SQL and `cursor.executemany` instead of `bulk_create`. In my case the time is shortened from 13 minutes to 54 seconds. http://stackoverflow.com/questions/32805766/best-practice-of-bulk-create-for-massive-records — stanleyxu2005, Oct 28 '15 at 05:54
@stanleyxu2005 really? I though Django made no checks in bulk_create. How is that even possible, any Django expert? — EralpB, Sep 22 '17 at 15:30

score 37 · Accepted Answer · edited May 05 '12 at 09:50

37

Django 1.4 provides a bulk_create() method on the QuerySet object, see:

edited May 05 '12 at 09:50

jnns

5,148
4
47
74

answered Feb 13 '12 at 21:23

Gary

3,425
25
18

2

Unless you need the auto primary keys back. That's doable with some database at least (MySQL, Postgres) if you use a raw query. – Eloff Mar 11 '13 at 13:40
Auto primary keys is now available at least since 1.11 for postgresql. Additional note of you're considering bulk_create for performance: https://code.djangoproject.com/ticket/28231 – NirIzr May 23 '17 at 17:21

Yanshuai Cao · Answer 2 · 2012-09-25T20:42:23.343

This is not specific to Django ORM, but recently I had to bulk insert >60 Million rows of 8 columns of data from over 2000 files into a sqlite3 database. And I learned that the following three things reduced the insert time from over 48 hours to ~1 hour:

increase the cache size setting of your DB to use more RAM (default ones always very small, I used 3GB); in sqlite, this is done by PRAGMA cache_size = n_of_pages;
do journalling in RAM instead of disk (this does cause slight problem if system fails, but something I consider to be negligible given that you have the source data on disk already); in sqlite this is done by PRAGMA journal_mode = MEMORY
last and perhaps most important one: do not build index while inserting. This also means to not declare UNIQUE or other constraint that might cause DB to build index. Build index only after you are done inserting.

As someone mentioned previously, you should also use cursor.executemany() (or just the shortcut conn.executemany()). To use it, do:

cursor.executemany('INSERT INTO mytable (field1, field2, field3) VALUES (?, ?, ?)', iterable_data)

The iterable_data could be a list or something alike, or even an open file reader.

have you ever used peewee ORM? I'm having performance issues with bulk insert — themantalope, May 28 '16 at 05:44
I added a question on how to do #3 in Django https://stackoverflow.com/questions/61270291/how-to-temporarily-disable-django-indexes-for-sqlite/61271893. #1 and #2 helped me lots too, thanks! — simonzack, Apr 17 '20 at 12:40

score 14 · Answer 3 · answered Nov 27 '10 at 22:19

14

Drop to DB-API and use cursor.executemany(). See PEP 249 for details.

answered Nov 27 '10 at 22:19

Ignacio Vazquez-Abrams

776,304
153
1,341
1,358

1

I couldn't find any documentation\examples how to use executemany() so I used execute() but each time I prepared an sql statement that inserted 3000 records as once. This sped up things. Thanks – Jonathan Livni Dec 03 '10 at 14:11
The documentation is in PEP 249. – Ignacio Vazquez-Abrams Dec 03 '10 at 14:12
If you want an example of `cursor.executemany()`, there is one in this other [answer](http://stackoverflow.com/questions/4298278/django-using-custom-raw-sql-inserts-with-executemany-and-mysql/6101536#6101536) – Papples Apr 28 '16 at 16:43

Chris · Answer 4 · 2020-07-31T15:00:59.850

I ran some tests on Django 1.10 / Postgresql 9.4 / Pandas 0.19.0 and got the following timings:

Insert 3000 rows individually and get ids from populated objects using Django ORM: 3200ms
Insert 3000 rows with Pandas DataFrame.to_sql() and don't get IDs: 774ms
Insert 3000 rows with Django manager .bulk_create(Model(**df.to_records())) and don't get IDs: 574ms
Insert 3000 rows with to_csv to StringIO buffer and COPY (cur.copy_from()) and don't get IDs: 118ms
Insert 3000 rows with to_csv and COPY and get IDs via simple SELECT WHERE ID > [max ID before insert] (probably not threadsafe unless COPY holds a lock on the table preventing simultaneous inserts?): 201ms

def bulk_to_sql(df, columns, model_cls):
    """ Inserting 3000 takes 774ms avg """
    engine = ExcelImportProcessor._get_sqlalchemy_engine()
    df[columns].to_sql(model_cls._meta.db_table, con=engine, if_exists='append', index=False)


def bulk_via_csv(df, columns, model_cls):
    """ Inserting 3000 takes 118ms avg """
    engine = ExcelImportProcessor._get_sqlalchemy_engine()
    connection = engine.raw_connection()
    cursor = connection.cursor()
    output = StringIO()
    df[columns].to_csv(output, sep='\t', header=False, index=False)
    output.seek(0)
    contents = output.getvalue()
    cur = connection.cursor()
    cur.copy_from(output, model_cls._meta.db_table, null="", columns=columns)
    connection.commit()
    cur.close()

The performance stats were all obtained on a table already containing 3,000 rows running on OS X (i7 SSD 16GB), average of ten runs using timeit.

I get my inserted primary keys back by assigning an import batch id and sorting by primary key, although I'm not 100% certain primary keys will always be assigned in the order the rows are serialized for the COPY command - would appreciate opinions either way.

Update 2020:

I tested the new to_sql(method="multi") functionality in Pandas >= 0.24, which puts all inserts into a single, multi-row insert statement. Surprisingly performance was worse than the single-row version, whether for Pandas versions 0.23, 0.24 or 1.1. Pandas single row inserts were also faster than a multi-row insert statement issued directly to the database. I am using more complex data in a bigger database this time, but to_csv and cursor.copy_from was still around 38% faster than the fastest alternative, which was a single-row df.to_sql, and bulk_import was occasionally comparable, but often slower still (up to double the time, Django 2.2).

score 5 · Answer 5 · answered Feb 09 '11 at 21:33

There is also a bulk insert snippet at http://djangosnippets.org/snippets/446/.

This gives one insert command multiple value pairs (INSERT INTO x (val1, val2) VALUES (1,2), (3,4) --etc etc). This should greatly improve performance.

It also appears to be heavily documented, which is always a plus.

score 4 · Answer 6 · answered Feb 18 '11 at 22:48

Also, if you want something quick and simple, you could try this: http://djangosnippets.org/snippets/2362/. It's a simple manager I used on a project.

The other snippet wasn't as simple and was really focused on bulk inserts for relationships. This is just a plain bulk insert and just uses the same INSERT query.

score 3 · Answer 7 · answered Oct 20 '11 at 12:59

3

Development django got bulk_create: https://docs.djangoproject.com/en/dev/ref/models/querysets/#django.db.models.query.QuerySet.bulk_create

answered Oct 20 '11 at 12:59

Ilia Novoselov

343
2
4

Accelerate bulk insert using Django's ORM?

7 Answers7

Update 2020:

Linked

Related