pandas to pyspark dataframe vs jdbc connection to pyspark dataframe

Question

I'm trying to improve the performance of my code by using the spark_session.read.format("jdbc") function. I got two different approaches:

Using the jdbc connection to a oracle table such like this:

 testDF = spark_session.read.format("jdbc") \
.option("url",<url>) \
.option("dbtable", "(SELECT distinct field, (ROW_NUMBER() OVER(ORDER BY field)) RNO FROM 
 <table> WHERE <CONDITION>)") \
.option("user", <user>) \
.option("paswword", <pass>) \
.option("numPartitions", 100) \
.option("partitionColumn", "RNO") \
.option("lowerBound", 1) \
.option("upperBound", 10) \
.load()

print(testDF.count())

Using the Oracle conexion, running the query and so transforming the result in a Pyspark dataframe:

pandasDF = pandas.read_sql("SELECT distinct field FROM 
 <table> WHERE <CONDITION>", con=connection)

 testDF = spark.createDataFrame(pandasDF)
 print(testDF.count())

I have tried the first approach with different values and parameters, but always the first one take up to 1 hour to finish and the second approach just take a few minutes, even 6 minutes sometimes.

I also want to join this table with another one, and here I can also notice the improvement of the second approach from the first one. I don't get to understand why the first one is that slow, I have also tried to persist it without success.

Can someone help me to understand it and try to solve the issue, helping me to use the first one rather than the second attempt?

It's hard to tell what could go wrong without knowing about your data scale and your Spark app configurations. This answer might relevant https://stackoverflow.com/a/67511760/3441510? — pltc, May 13 '21 at 03:20

score 0 · Accepted Answer · edited May 13 '21 at 14:24

I finally managed to make it faster, from 1 hour to 9 minutes. I was testing the code and what worked for me was to play with the values of:

    option("numPartitions", X)
    option("lowerBound", Y) 
    option("upperBound", Z)

For the number of partitions I had to check what were the most optimal configuration for my number of executors and cores. For that I also used the spark web UI and here I could see that even if I put a big number of partitions, spark couldn't handle it in parallel so for me a smaller number of partitions worked , that is X = 100.

As for Y and Z I used as Y = 0 and for Z I used the number of values to be read in the query, so for that, first of all I made a query to give me the number of values, so in my case Z = 4515970.

Using this configuration, the workers are able to parallelize the process and do different queries to fetch smaller chunks of data from the oracle database.

pandas to pyspark dataframe vs jdbc connection to pyspark dataframe

1 Answers1