I am trying to write a Pyspark dataframe of ~3 millions rows x 158 columns (~3GB) to TimeScale DB.
The write operation is executed from a Jupyter Kernel with the following ressources :
- 1 Driver, 2 vcpu, 2GB memory
- 2 Executors, 2 vcpu, 4GB memory
As one could expect, it is fairly slow.
I know of repartition
and batchsize
, so I am trying to play with those parameters to speed up the write operation, but I was wondering what would be the optimal parameters to be as performant as possible.
df.rdd.getNumPartitions()
is 7, should I try to increase or decrease the number of partitions?
I've tried playing with it a bit but I did not get any conclusive result. Increasing the number of partitions does seem to slow the writing, but it might just be because of Spark performing repartition
first.
I am more specifically wondering about batchsize
. I guess optimal batchsize depends on TimeScale/Postgre config, but I haven't been able to find more info about this.
For the record, here is an example of what I've tried :
df.write \
.mode("overwrite") \
.format('jdbc') \
.option('url', 'my_url') \
.option('user', 'my_user') \
.option('password', 'my_pwd') \
.option('dbtable', 'my_table') \
.option('numPartitions', '5') \
.option('batchsize', '10000') \
.save()
This took 26 minutes on a much smaller sample of the dataframe (~500K rows, 500MB).
We are aware our Jupyter kernel is lacking in resources and are trying to work on that too, but is there a way to optimize the writing speed with Spark and TimeScale parameters?
[EDIT] I have also read this very helpful answer about using COPY, but we are specifically looking for ways to increase performance using Spark for now.