Optimal parameters to speed spark df.write to PostgreSQL

Question

I am trying to write a Pyspark dataframe of ~3 millions rows x 158 columns (~3GB) to TimeScale DB.

The write operation is executed from a Jupyter Kernel with the following ressources :

1 Driver, 2 vcpu, 2GB memory
2 Executors, 2 vcpu, 4GB memory

As one could expect, it is fairly slow.

I know of repartition and batchsize, so I am trying to play with those parameters to speed up the write operation, but I was wondering what would be the optimal parameters to be as performant as possible.

df.rdd.getNumPartitions() is 7, should I try to increase or decrease the number of partitions? I've tried playing with it a bit but I did not get any conclusive result. Increasing the number of partitions does seem to slow the writing, but it might just be because of Spark performing repartition first.

I am more specifically wondering about batchsize. I guess optimal batchsize depends on TimeScale/Postgre config, but I haven't been able to find more info about this.

For the record, here is an example of what I've tried :

df.write \
  .mode("overwrite") \
  .format('jdbc') \
  .option('url', 'my_url') \
  .option('user', 'my_user') \
  .option('password', 'my_pwd') \
  .option('dbtable', 'my_table') \
  .option('numPartitions', '5') \
  .option('batchsize', '10000') \
  .save()

This took 26 minutes on a much smaller sample of the dataframe (~500K rows, 500MB).

We are aware our Jupyter kernel is lacking in resources and are trying to work on that too, but is there a way to optimize the writing speed with Spark and TimeScale parameters?

[EDIT] I have also read this very helpful answer about using COPY, but we are specifically looking for ways to increase performance using Spark for now.

Some advice concerning the DB: write to a new table that hasn't got a single constraint or index, add these when you have finished writing. — Gerard H. Pille, Jan 18 '21 at 13:16
Thanks, I had indeed read about that. I'm overwriting the table on each write for now, and have not set any index or other constraint, so it should not be a problem, but we'll have to keep it in mind once we start to append data to the table. — Flxnt, Jan 18 '21 at 13:20
Don't forget that Spark isn't suited to work with a database that is shared with other applications. I would always write to a new table, and have a stored procedure do what's necessary afterwards. — Gerard H. Pille, Jan 18 '21 at 13:45
You can use spark along with the copy command to make it more efficient and performant — Nikunj Kakadiya, Feb 07 '21 at 06:14

score 0 · Answer 1 · answered Feb 06 '21 at 21:03

If it's using the JDBC, the reWriteBatchedInserts=true parameter, which was introduced a while back https://jdbc.postgresql.org/documentation/changelog.html#version_9.4.1209 will likely speed things up significantly. It should just be able to be introduced to the connection string, or potentially there's a way to specify to use it in the Spark connector.

Optimal parameters to speed spark df.write to PostgreSQL

1 Answers1