I am trying to read data from Postgres table using Spark. Initially I was reading the data on the single thread without using lowerBound
, upperBound
, partitionColumn
and numPartitions
. The data that I'm reading is huge, around 120 Million records. So I decided to read the data in parallel using partitionColumn
. I am able to read the data but it is taking more time to read it by 12 parallel threads than by a single thread. I am unable to figure out how can I see the 12 SQL queries that gets generated to fetch the data in parallel for each partition.
The code that I am using is:
val query = s"(select * from db.testtable) as testquery"
val df = spark.read
.format("jdbc")
.option("url", jdbcurl)
.option("dbtable", query)
.option("partitionColumn","transactionbegin")
.option("numPartitions",12)
.option("driver", "org.postgresql.Driver")
.option("fetchsize", 50000)
.option("user","user")
.option("password", "password")
.option("lowerBound","2019-01-01 00:00:00")
.option("upperBound","2019-12-31 23:59:00")
.load
df.count()
Where and how can I see the 12 parallel queries that are getting created to read the data in parallel on each thread?
I am able to see that 12 tasks are created in the Spark UI but not able to find a way to see what separate 12 queries are generated to fetch data in parallel from the Postgres table.
Is there any way I can push the filter down so that it reads only this year worth of data, in this case 2019.