Oracle to spark dataframe in Databricks is running slow

Question

I am trying to get data from Oracle into a spark dataframe on databricks. However I noticed the query is taking a long time to rum when I trigger the query from databricks (2.5 hours). The same query is taking less than 30 mins when I run it on SQL Developer. Can someone please help me understand why it is taking longer to run the query through databricks than SQL Developer.?

oracle_df = spark.read \
    .format("jdbc") \
    .option("url", jdbc_url) \
    .option("dbtable", 'query') \
    .option("user", "user") \
    .option("password", "password") \
    .option("driver", "oracle.jdbc.driver.OracleDriver") \
    .option("fetchsize", 10000) \
    .option("numPartitions", 5) \
    .load()

JDBC driver tend to be slow , requires additional hit and trial before you can get the right set of parameters to get the intended gain , additional you can go through this to see if this helps you - https://docs.databricks.com/data/data-sources/sql-databases.html#optimize-performance-when-reading-data — Vaebhav, Apr 15 '22 at 05:41
I would suggest to have an intermediatory sink of your oracle DB objects in a cloud bucket , which you shd read as your dataframe and then perform the final analysis — Vaebhav, Apr 15 '22 at 05:42
Does this answer your question? [Why spark is slower when compared to sqoop , when it comes to jdbc?](https://stackoverflow.com/questions/56288481/why-spark-is-slower-when-compared-to-sqoop-when-it-comes-to-jdbc) — Alex Ott, Apr 15 '22 at 08:07
@Vaebhav, the query here is a complex sql with series of CTE expressions (legacy code), I do not have an {id} to use for upperbound and lowerbound. Also, the query results only in 100 records. Would modifying the fetchsize make any difference here ? — Bob, Apr 15 '22 at 16:30
If i understand this correctly , you are trying to run a already completed query via Spark as an environment , and not rewriting your query in SparkSQL or equivalent scala or python code right ? — Vaebhav, Apr 15 '22 at 16:48
yes, the query is running in Oracle using JDBC not on Spark. — Bob, Apr 16 '22 at 00:19
Based on your reply , you are not fully utilising the power of Spark , as you are just delegating it towards your Oracle Host , Hence I would recommended you to rewrite your data transformations native to Spark — Vaebhav, Apr 17 '22 at 06:58
As your performance is more dependent on your DB object table structure & there indexes , other way you can opt for is to optimise your current queries natively in Oracle and forgo Spark together — Vaebhav, Apr 17 '22 at 06:59
Yes, I understood that but do you know why the same query is running for 30 mins in SQL Developer and it takes 2.5 hours when I trigger it from spark ? — Bob, Apr 18 '22 at 00:53

Oracle to spark dataframe in Databricks is running slow

0 Answers0