Optimal chunksize parameter in pandas.DataFrame.to_sql

Question

Working with a large pandas DataFrame that needs to be dumped into a PostgreSQL table. From what I've read it's not a good idea to dump all at once, (and I was locking up the db) rather use the chunksize parameter. The answers here are helpful for workflow, but I'm just asking about the value of chunksize affecting performance.

In [5]: df.shape
Out[5]: (24594591, 4)

In [6]: df.to_sql('existing_table',
                  con=engine, 
                  index=False, 
                  if_exists='append', 
                  chunksize=10000)

Is there a recommended default and is there a difference in performance when setting the parameter higher or lower? Assuming I have the memory to support a larger chunksize, will it execute faster?

FYI, the question you link to is for `read_csv`, so not entirely relevant. The `chunksize` of `to_sql` is useful when you get time out errors (see http://pandas.pydata.org/pandas-docs/stable/io.html#writing-dataframes or http://stackoverflow.com/questions/24007762/python-pandas-using-to-sql-to-write-large-data-frames-in-chunks). If you don't have that problem, you do not need to use `chunksize` — joris, Feb 04 '16 at 14:28
[relevant link](http://acepor.github.io/2017/08/03/using-chunksize/) with a benchmark. — Kevin, Sep 11 '18 at 12:22

score 11 · Answer 1 · answered Dec 27 '20 at 16:52

In my case, 3M rows having 5 columns were inserted in 8 mins when I used pandas to_sql function parameters as chunksize=5000 and method='multi'. This was a huge improvement as inserting 3M rows using python into the database was becoming very hard for me.

score 3 · Answer 2 · answered Aug 14 '17 at 07:56

3

I tried something the other way around. From sql to csv and I noticed that the smaller the chunksize the quicker the job was done. Adding additional cpus to the job (multiprocessing) didn't change anything.

answered Aug 14 '17 at 07:56

Mohamed Amin Chairi

123
2
12

4

It may not always be the case that the smaller the chunksize, the quicker the process is. According to the observations in this article (https://acepor.github.io/2017/08/03/using-chunksize), setting the chunksize to 10000 seems to be optimal. – Ethan Chen Aug 07 '19 at 23:35

Optimal chunksize parameter in pandas.DataFrame.to_sql

2 Answers2

Linked