8

I have a Redshift server, which I initiate via psycopg2 (note that ODBC is not supported on the company server so I cant use pyodbc).

Currently it is taken over 10 minutes for 30-35k rows via pd.to_sql(), which writes from dataframe into the Redshift DB. So as a work-around I download DF as csv, push file to S3, and then use copy to write into the DB.

The fast_executemany solution as per Speeding up pandas.DataFrame.to_sql with fast_executemany of pyODBC would have been perfect- however this is not supported in psycopg2. I also found d6tstack as per https://github.com/d6t/d6tstack/blob/master/examples-sql.ipynb but pd_to_psql doesn't work for Redshift, only Postgresql (can't copy... from stdin)

Any alternatives I can use for my case?

This is my code:

import sqlalchemy as sa

DATABASE = ""
USER = ""
PASSWORD = ""
HOST = "...us-east-1.redshift.amazonaws.com"
PORT = "5439"
SCHEMA = "public" 

server = "redshift+psycopg2://%s:%s@%s:%s/%s" % (USER,PASSWORD,HOST,str(PORT),DATABASE)
engine = sa.create_engine(server)
conn = engine.raw_connection()

with conn.cursor() as cur:
    cur.execute('truncate table_name')

df.to_sql('table_name', engine, index=False, if_exists='append') 
user8834780
  • 1,620
  • 3
  • 21
  • 48
  • What version of Pandas are you using? Also, your workaround seems to be [the preferred method](https://docs.aws.amazon.com/redshift/latest/dg/t_Loading_data.html). Is it also slow? [This](https://stackoverflow.com/questions/25454477/amazon-redshift-bulk-insert-vs-copying-from-s3) seems like it's related to your issue at hand, at least somewhat. – Ilja Everilä Dec 11 '18 at 20:19
  • @IljaEverilä Pandas v 0.23.4. The copy is quick, but seems unnecessary for me to download DF as CSV, upload to S3, then copy to Redshift.. Hence, looking for a direct approach- DF to SQL – user8834780 Dec 11 '18 at 20:30
  • Brought up the version because [Pandas version 0.23.0, and 0.24.0 and up use multi values inserts](https://stackoverflow.com/questions/50689082/to-sql-pyodbc-count-field-incorrect-or-syntax-error) with dialects that support them. That may improve insert performance in some situations. SQLAlchemy can also use the Psycopg2 extras for executemany style inserts in new versions, if [`use_batch_mode=True`](https://docs.sqlalchemy.org/en/latest/dialects/postgresql.html#psycopg2-batch-mode-fast-execution) is passed to `create_engine()`. – Ilja Everilä Dec 11 '18 at 21:07
  • sqlalchemy docs dont seem to be working atm. Will check out use batch mode when it doesn, since I am not sure how to implement it – user8834780 Dec 11 '18 at 22:08
  • @IljaEverilä still can't seem to access the docks. Would you be able to write an answer for this? – user8834780 Dec 14 '18 at 17:20

1 Answers1

5

If you are unable to use COPY from S3 and must rely on DML, you could try passing use_batch_mode=True to create_engine():

engine = create_engine('theurl', use_batch_mode=True)

From this machine a simple insert of 500 rows to a Redshift cluster showed reasonable improvement with batch mode enabled:

In [31]: df = pd.DataFrame({'batchno': range(500)})

In [32]: %time df.to_sql('batch', engine, index=False, if_exists='append')
CPU times: user 87.8 ms, sys: 57.6 ms, total: 145 ms
Wall time: 1min 6s

In [33]: %time df.to_sql('batch', bm_engine, index=False, if_exists='append')
CPU times: user 10.3 ms, sys: 4.66 ms, total: 15 ms
Wall time: 9.96 s

Note that Pandas versions 0.23.0, and 0.24.0 and up will not benefit from using batch mode, since they use multi values insert instead of executemany, if the underlying DBMS supports it. Using multi values insert should provide somewhat similar improvements on throughput, as fewer queries are issued.

Ilja Everilä
  • 50,538
  • 7
  • 126
  • 127
  • Thank you! Would you say there is any benefit for using chunksize? or let it run as is? – user8834780 Dec 17 '18 at 17:54
  • I suppose for versions of Pandas that use executemany, and so benefit from `use_batch_mode`, chunksize should not matter, because SQLA also splits the values to batches. But with versions that use multiple values insert you could try different chunksizes. – Ilja Everilä Dec 17 '18 at 20:02
  • Thanks a lot for this answer! Although this note doesn't seem to be true to me: "Note that Pandas versions 0.23.0, and 0.24.0 and up will not benefit from using batch mode". I am using Pandas 0.24.0 and had a pd.DataFrame.to_sql() to pg hanging for over 20 minutes without `batch_mode=True`, and just now finished in 1 second after adding `batch_mode=True` as suggested and changing nothing else. – Max Power Apr 10 '19 at 18:58