1

I am currently using SQLAlchemy to write a pandas dataframe to a postgresql database on an AWS server. My code looks like this

engine = create_engine(
    'postgresql://{}:{}@{}:{}/{}'.format(ModelData.user, ModelData.password, ModelData.host, ModelData.port,
                                             ModelData.database), echo=True)
with open(file, 'rb') as f:
    df = pickle.load(f)
df.to_sql(table_name, engine, method='multi', if_exists='replace', index=False, chunksize=1000)

The table I am writing has about 900,000 rows and 500 columns. It takes quite a long time to complete. Is there a faster way to write this data? Sometimes I will wait all day and still not be complete. To reiterate, this post is about speed and not about execution. Any help would be appreciated!

Note: The machine I'm using has 32 GB of RAM, i7 processor, 1 TB storage, and a GPU so I don't think it's my machine.

Ruben Helsloot
  • 12,582
  • 6
  • 26
  • 49
bballboy8
  • 400
  • 6
  • 25

1 Answers1

0

Have you played with the chunksize parameter?

With a dataset that large you may need to write it out to a tab-delimited file, transfer it to your EC2 instance, and then use the \copy command in psql to get it done in a reasonable amount of time.

Since it is RDS instead of EC2, I checked, and it looks like RDS PostgreSQL supports COPY from S3: https://www.postgresql.org/docs/current/sql-copy.html

This will mean some work setting up the table yourself, but the time savings may make it worth it. If you can make a small subset of your DF and have DataFrame.to_sql() use it to create the table for you, you can then copy your tab-separated values file to S3 and COPY into it using the utility.

Mike Organek
  • 11,647
  • 3
  • 11
  • 26