0

I am trying to write a Pyspark dataframe of ~3 millions rows x 158 columns (~3GB) to TimeScale DB.

The write operation is executed from a Jupyter Kernel with the following ressources :

  • 1 Driver, 2 vcpu, 2GB memory
  • 2 Executors, 2 vcpu, 4GB memory

As one could expect, it is fairly slow.

I know of repartition and batchsize, so I am trying to play with those parameters to speed up the write operation, but I was wondering what would be the optimal parameters to be as performant as possible.

df.rdd.getNumPartitions() is 7, should I try to increase or decrease the number of partitions? I've tried playing with it a bit but I did not get any conclusive result. Increasing the number of partitions does seem to slow the writing, but it might just be because of Spark performing repartition first.

I am more specifically wondering about batchsize. I guess optimal batchsize depends on TimeScale/Postgre config, but I haven't been able to find more info about this.

For the record, here is an example of what I've tried :

df.write \
  .mode("overwrite") \
  .format('jdbc') \
  .option('url', 'my_url') \
  .option('user', 'my_user') \
  .option('password', 'my_pwd') \
  .option('dbtable', 'my_table') \
  .option('numPartitions', '5') \
  .option('batchsize', '10000') \
  .save()

This took 26 minutes on a much smaller sample of the dataframe (~500K rows, 500MB).

We are aware our Jupyter kernel is lacking in resources and are trying to work on that too, but is there a way to optimize the writing speed with Spark and TimeScale parameters?

[EDIT] I have also read this very helpful answer about using COPY, but we are specifically looking for ways to increase performance using Spark for now.

Flxnt
  • 177
  • 4
  • 22
  • 1
    Some advice concerning the DB: write to a new table that hasn't got a single constraint or index, add these when you have finished writing. – Gerard H. Pille Jan 18 '21 at 13:16
  • Thanks, I had indeed read about that. I'm overwriting the table on each write for now, and have not set any index or other constraint, so it should not be a problem, but we'll have to keep it in mind once we start to append data to the table. – Flxnt Jan 18 '21 at 13:20
  • Don't forget that Spark isn't suited to work with a database that is shared with other applications. I would always write to a new table, and have a stored procedure do what's necessary afterwards. – Gerard H. Pille Jan 18 '21 at 13:45
  • You can use spark along with the copy command to make it more efficient and performant – Nikunj Kakadiya Feb 07 '21 at 06:14

1 Answers1

0

If it's using the JDBC, the reWriteBatchedInserts=true parameter, which was introduced a while back https://jdbc.postgresql.org/documentation/changelog.html#version_9.4.1209 will likely speed things up significantly. It should just be able to be introduced to the connection string, or potentially there's a way to specify to use it in the Spark connector.

davidk
  • 1,003
  • 5
  • 9