1

I have written a program using pyspark to connect to oracle database and fetch data. Below command works fine and returns the contents of the table:

sqlContext.read.format("jdbc")
    .option("url","jdbc:oracle:thin:user/password@dbserver:port/dbname")
    .option("dbtable","SCHEMA.TABLE")
    .option("driver","oracle.jdbc.driver.OracleDriver")
    .load().show()

Now I do not want to load the entire table data. I want to load selected records. Can I specify select query as part of this command? If yes how?

Note: I can use dataframe and execute select query on the top of it but I do not want to do it. Please help!!

Community
  • 1
  • 1
SANKALP VERMA
  • 11
  • 1
  • 2

2 Answers2

0

You can use subquery in dbtable option

.option("dbtable", "(SELECT * FROM tableName) AS tmp where x = 1")

Here is similar question, but about MySQL

Community
  • 1
  • 1
T. Gawęda
  • 15,706
  • 4
  • 46
  • 61
0

In general, the optimizer SHOULD be able to push down any relevant select and where elements so if you now do df.select("a","b","c").where("d<10") then in general this should be pushed down to oracle. You can check it by doing df.explain(true) on the final dataframe.

Assaf Mendelson
  • 12,701
  • 5
  • 47
  • 56