1

I am trying to get data from Oracle into a spark dataframe on databricks. However I noticed the query is taking a long time to rum when I trigger the query from databricks (2.5 hours). The same query is taking less than 30 mins when I run it on SQL Developer. Can someone please help me understand why it is taking longer to run the query through databricks than SQL Developer.?

oracle_df = spark.read \
    .format("jdbc") \
    .option("url", jdbc_url) \
    .option("dbtable", 'query') \
    .option("user", "user") \
    .option("password", "password") \
    .option("driver", "oracle.jdbc.driver.OracleDriver") \
    .option("fetchsize", 10000) \
    .option("numPartitions", 5) \
    .load()
Bob
  • 335
  • 1
  • 4
  • 16
  • JDBC driver tend to be slow , requires additional hit and trial before you can get the right set of parameters to get the intended gain , additional you can go through this to see if this helps you - https://docs.databricks.com/data/data-sources/sql-databases.html#optimize-performance-when-reading-data – Vaebhav Apr 15 '22 at 05:41
  • I would suggest to have an intermediatory sink of your oracle DB objects in a cloud bucket , which you shd read as your dataframe and then perform the final analysis – Vaebhav Apr 15 '22 at 05:42
  • Does this answer your question? [Why spark is slower when compared to sqoop , when it comes to jdbc?](https://stackoverflow.com/questions/56288481/why-spark-is-slower-when-compared-to-sqoop-when-it-comes-to-jdbc) – Alex Ott Apr 15 '22 at 08:07
  • @Vaebhav, the query here is a complex sql with series of CTE expressions (legacy code), I do not have an {id} to use for upperbound and lowerbound. Also, the query results only in 100 records. Would modifying the fetchsize make any difference here ? – Bob Apr 15 '22 at 16:30
  • If i understand this correctly , you are trying to run a already completed query via Spark as an environment , and not rewriting your query in SparkSQL or equivalent scala or python code right ? – Vaebhav Apr 15 '22 at 16:48
  • yes, the query is running in Oracle using JDBC not on Spark. – Bob Apr 16 '22 at 00:19
  • @Vaebhav is there any other way to optimize this process? – Bob Apr 17 '22 at 06:09
  • Based on your reply , you are not fully utilising the power of Spark , as you are just delegating it towards your Oracle Host , Hence I would recommended you to rewrite your data transformations native to Spark – Vaebhav Apr 17 '22 at 06:58
  • As your performance is more dependent on your DB object table structure & there indexes , other way you can opt for is to optimise your current queries natively in Oracle and forgo Spark together – Vaebhav Apr 17 '22 at 06:59
  • Yes, I understood that but do you know why the same query is running for 30 mins in SQL Developer and it takes 2.5 hours when I trigger it from spark ? – Bob Apr 18 '22 at 00:53

0 Answers0