I ran some tests on Django 1.10 / Postgresql 9.4 / Pandas 0.19.0 and got the following timings:
- Insert 3000 rows individually and get ids from populated objects using Django ORM: 3200ms
- Insert 3000 rows with Pandas
DataFrame.to_sql()
and don't get IDs: 774ms
- Insert 3000 rows with Django manager
.bulk_create(Model(**df.to_records()))
and don't get IDs: 574ms
- Insert 3000 rows with
to_csv
to StringIO
buffer and COPY
(cur.copy_from()
) and don't get IDs: 118ms
- Insert 3000 rows with
to_csv
and COPY
and get IDs via simple SELECT WHERE ID > [max ID before insert]
(probably not threadsafe unless COPY
holds a lock on the table preventing simultaneous inserts?): 201ms
def bulk_to_sql(df, columns, model_cls):
""" Inserting 3000 takes 774ms avg """
engine = ExcelImportProcessor._get_sqlalchemy_engine()
df[columns].to_sql(model_cls._meta.db_table, con=engine, if_exists='append', index=False)
def bulk_via_csv(df, columns, model_cls):
""" Inserting 3000 takes 118ms avg """
engine = ExcelImportProcessor._get_sqlalchemy_engine()
connection = engine.raw_connection()
cursor = connection.cursor()
output = StringIO()
df[columns].to_csv(output, sep='\t', header=False, index=False)
output.seek(0)
contents = output.getvalue()
cur = connection.cursor()
cur.copy_from(output, model_cls._meta.db_table, null="", columns=columns)
connection.commit()
cur.close()
The performance stats were all obtained on a table already containing 3,000 rows running on OS X (i7 SSD 16GB), average of ten runs using timeit
.
I get my inserted primary keys back by assigning an import batch id and sorting by primary key, although I'm not 100% certain primary keys will always be assigned in the order the rows are serialized for the COPY
command - would appreciate opinions either way.
Update 2020:
I tested the new to_sql(method="multi")
functionality in Pandas >= 0.24, which puts all inserts into a single, multi-row insert statement. Surprisingly performance was worse than the single-row version, whether for Pandas versions 0.23, 0.24 or 1.1. Pandas single row inserts were also faster than a multi-row insert statement issued directly to the database. I am using more complex data in a bigger database this time, but to_csv
and cursor.copy_from
was still around 38% faster than the fastest alternative, which was a single-row df.to_sql
, and bulk_import
was occasionally comparable, but often slower still (up to double the time, Django 2.2).