Dataframe writing to Postgresql poor performance

Question

working in postgresql I have a cartesian join producing ~4 million rows. The join takes ~5sec and the write back to the DB takes ~1min 45sec.

The data will be required for use in python, specifically in a pandas dataframe, so I am experimenting with duplicating this same data in python. I should say here that all these tests are running on one machine, so nothing is going across a network.

Using psycopg2 and pandas, reading in the data and performing the join to get the 4 million rows (from an answer here:cartesian product in pandas) takes consistently under 3 secs, impressive.

Writing the data back to a table in the database however takes anything from 8 minutes (best method) to 36+minutes (plus some methods I rejected as I had to stop them after >1hr).

While I was not expecting to reproduce the "sql only" time, I would hope to be able to get closer than 8 minutes (I`d have thought 3-5 mins would not be unreasonable).

Slower methods include:

36min - sqlalchemy`s table.insert (from 'test_sqlalchemy_core' here https://docs.sqlalchemy.org/en/latest/faq/performance.html#i-m-inserting-400-000-rows-with-the-orm-and-it-s-really-slow)

13min - psycopg2.extras.execute_batch (https://stackoverflow.com/a/52124686/3979391)

13-15min (depends on chunksize) - pandas.dataframe.to_sql (again using sqlalchemy) (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_sql.html)

Best way (~8min) is using psycopg2`s cursor.copy_from method (found here: https://github.com/blaze/odo/issues/614#issuecomment-428332541). This involves dumping the data to a csv first (in memory via io.StringIO), that alone takes 2 mins.

So, my questions:

Anyone have any potentially faster ways of writing millions of rows from a pandas dataframe to postgresql?
The docs for the cursor.copy_from method (http://initd.org/psycopg/docs/cursor.html) state that the source object needs to support the read() and readline() methods (hence the need for io.StringIO). Presumably, if the dataframe supported those methods, we could dispense with the write to csv. Is there some way to add these methods?

Thanks. Giles

EDIT:

On Q2 - pandas can now use a custom callable for to_sql and the given example here: https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#io-sql-method does pretty much what I suggest above (IE it copies csv data directly from STDIN using StringIO). I found an ~40% increase in write speed using this method, which brings to_sql close to the "best" method mentioned above.

Does this answer your question? [Write fast pandas dataframe to postgres](https://stackoverflow.com/questions/41875817/write-fast-pandas-dataframe-to-postgres) — Scott Smith, Apr 11 '20 at 03:28
I flagged this as a duplicate, though your question has a lot of valuable info on the speed of different approaches. If possible add this to the Q&A on that question! — Scott Smith, Apr 11 '20 at 03:29
Thanks, looks like the answer there is using my "Best Method" from above. I have added a comment to the accepted answer with a link here in case it helps anyone. — Giles, Apr 12 '20 at 11:49
Also, not convinced this is a duplicate - the other is "can you speed up this code" whereas this is "what is the fastest method". — Giles, Apr 12 '20 at 12:08

score 1 · Answer 1 · answered Apr 05 '19 at 11:51

Answering Q 1 myself: It seems the issue had more to do with Postgresql (or rather Databases in general). Taking into account points made in this article:https://use-the-index-luke.com/sql/dml/insert I found the following:

1) Removing all indexes from the destination table resulted in the query running in 9 seconds. Rebuilding the indexes (in postgresql) took a further 12 seconds, so still well under the other times.

2) With only a primary key in place, Inserting rows ordered by the primary key columns reduced the time taken to about a third. This makes sense as there should be little or no shuffling of the index rows required. I also verified that this is the reason why my cartesian join in postgresql was faster in the first place (IE the rows were ordered by the index, purely by chance), placing the same rows in a temporary table (unordered) and inserting from that actually took allot longer.

3) I tried similar experiments on our mysql systems and found the same increase in insert speed when removing indexes. With mysql however it seemed that rebuilding the indexes used up any time gained.

I hope this helps anyone else who comes across this question from a search.

I still wonder if it is possible to remove the write to csv step in python (Q2 above) as I believe I could then write something in python that would be faster than pure postgresql.

Thanks, Giles

Dataframe writing to Postgresql poor performance

EDIT:

1 Answers1

Linked