I'm trying to improve the performance of my code by using the spark_session.read.format("jdbc") function. I got two different approaches:
Using the jdbc connection to a oracle table such like this:
testDF = spark_session.read.format("jdbc") \ .option("url",<url>) \ .option("dbtable", "(SELECT distinct field, (ROW_NUMBER() OVER(ORDER BY field)) RNO FROM <table> WHERE <CONDITION>)") \ .option("user", <user>) \ .option("paswword", <pass>) \ .option("numPartitions", 100) \ .option("partitionColumn", "RNO") \ .option("lowerBound", 1) \ .option("upperBound", 10) \ .load() print(testDF.count())
Using the Oracle conexion, running the query and so transforming the result in a Pyspark dataframe:
pandasDF = pandas.read_sql("SELECT distinct field FROM <table> WHERE <CONDITION>", con=connection) testDF = spark.createDataFrame(pandasDF) print(testDF.count())
I have tried the first approach with different values and parameters, but always the first one take up to 1 hour to finish and the second approach just take a few minutes, even 6 minutes sometimes.
I also want to join this table with another one, and here I can also notice the improvement of the second approach from the first one. I don't get to understand why the first one is that slow, I have also tried to persist it without success.
Can someone help me to understand it and try to solve the issue, helping me to use the first one rather than the second attempt?