0

so I have a really huge table contains billions of rows, I tried the Spark DataFrame API to load data, here is my code:

sql = "select * from mytable where day = 2016-11-25 and hour = 10"
df = sqlContext.read \
    .format("jdbc") \
    .option("driver", driver) \
    .option("url", url) \
    .option("user", user) \
    .option("password", password) \
    .option("dbtable", table) \
    .load(sql)

df.show()

I typed the sql in mysql, it returns about 100 rows,but the above sql did not work in spark sql, it occurs OOM error, it seems like that spark sql load all data into memory without using where clause. So how can spark sql using where clause?

wxweven
  • 61
  • 1
  • 5
  • try running this : df = sqlContext.read \ .format("jdbc") \ .option("driver", driver) \ .option("url", url) \ .option("user", user) \ .option("password", password) \ .option("dbtable", table) \ .load("select * from mytable").where($"day"==="2016-11-25" and $"hour" === 10) – Shivansh Nov 25 '16 at 08:47
  • 1
    Possible duplicate of [In Apache Spark 2.0.0, is it possible to fetch a query from an external database (rather than grab the whole table)?](http://stackoverflow.com/questions/38729436/in-apache-spark-2-0-0-is-it-possible-to-fetch-a-query-from-an-external-database) –  Nov 25 '16 at 09:11

2 Answers2

3

I have solved the problem. the spark doc gives the answer: spark doc

spark doc

So the key is to change the "dbtalble" option, make your sql a subquery. The correct answer is :

// 1. write your query sql as a subquery
sql = "(select * from mytable where day = 2016-11-25 and hour = 10) t1"
df = sqlContext.read \
    .format("jdbc") \
    .option("driver", driver) \
    .option("url", url) \
    .option("user", user) \
    .option("password", password) \
    .option("dbtable", sql) \ // 2. change "dbtable" option to your subquery sql
    .load(sql)

df.show()
wxweven
  • 61
  • 1
  • 5
0
sql = "(select * from mytable where day = 2016-11-25 and hour = 10) as t1"
    df = sqlContext.read
    .format("jdbc")
    .option("driver", driver)
    .option("url", url)
    .option("user", user)
    .option("password", password)
    .option("dbtable", sql)
    .load(sql)

df.show()
  • While this code may answer the question, providing additional context regarding how and/or why it solves the problem would improve the answer's long-term value. – Piotr Labunski Mar 23 '20 at 08:57
  • 1
    @PiotrLabunski Here spark SQL will execute the query and creates a temporary view as "t1" with the filtered data as per where clause and loads into the data frame. Hope this will help. – Mangesh.K Mar 25 '20 at 09:28