0

I am reading table data from sql server and storing it as a Dataframe in spark i want to write back the df to a parquet file in s3 as the table has around 30 columns and 204 million rows, right now with spark.write its taking around 25-30 minutes to write back to s3 as a parquet file is there any fastest way to do the same ?

Edit-1 Do suggest me if you have any better approach to the same as well instead of direct spark.write

Sundeep Pidugu
  • 2,377
  • 2
  • 21
  • 43
  • 1
    You could try increasing the number of partitions when writing - depending on your setup it could be faster. – Shaido Oct 31 '18 at 09:51
  • 1
    what could be the optimum partition number and is there any other approach to do the same ? – Sundeep Pidugu Oct 31 '18 at 10:01
  • 1
    Hard to say, maybe this can help get an idea: https://stackoverflow.com/questions/39381041/determining-optimal-number-of-spark-partitions-based-on-workers-cores-and-dataf – Shaido Oct 31 '18 at 10:02
  • 2
    Also the limiting factor might be the source SQL server, if you open far too many connections to it, aka partitions, it might not have enough resources to release data fast enough and most of your executors might be just idling. If you take into account transaction on the source sql server, than connections might be bogged down by the transaction locks. You can do a series of experiments: first run with just one executor and time how much time it takes, if more than 25 mins, increase number of executors and compare the time, if it stopped decreasing you hit the DB limit. Also check DB stats. – alexeipab Oct 31 '18 at 10:25
  • 1
    This could be classical case of ["Parquet Tax"](https://arnon.me/2015/08/spark-parquet-s3/), I've stumbled upon it earlier with `Spark`. To overcome it, I had to resort to writing data on `HDFS` and then `distcp`ying it to `S3`. Also see [`Spark DirectOutputCommiter`](https://stackoverflow.com/questions/39630120/). – y2k-shubham Oct 31 '18 at 11:18

0 Answers0