1

I am using Cloudera JDBC Driver for Impala v 2.5.38 with Spark 1.6.0 to create DataFrame. It is working fine for all queries except WITH clause, but WITH is extensively used in my organization. Below is my code snippet.

def jdbcHDFS(url:String,sql: String):DataFrame = {
  var rddDF: DataFrame = null
  val jdbcURL = s"jdbc:impala://$url"
  val connectionProperties = new java.util.Properties
  connectionProperties.setProperty("driver","com.cloudera.impala.jdbc41.Driver")
  rddDF = sqlContext.read.jdbc(jdbcURL, s"($sql) AS ST", connectionProperties)
  rddDF
}

Given below example for working and non-working SQL

val workingSQL = "select empname from (select * from employee) as tmp"
val nonWorkingSQL = "WITH tmp as (select * from employee) select empname from tmp"

Below is the output of rddDF.first for above SQLs.

For workingSQL

scala> rddDF.first
res8: org.apache.spark.sql.Row = [Kushal]

For nonWorkingSQL

scala> rddDF.first
res8: org.apache.spark.sql.Row = [empname] //Here we are expecting actual data ie. 'Kushal' instead of column name like the output of previous query.

It would be really helpful if anyone can suggest any solution for it.

Please note: Both the queries are working fine in IMPALA-SHELL as well as in HIVE through HUE.

Update: I have tried to setup plain JDBC connection and execute the nonWorkingSQL and it worked! Then i thought the issue is due to Spark wraps a "SELECT * FROM ( )" around the query, hence i tried the below SQL to find the root cause but still it worked and displayed expected result.

String sql  = "SELECT * FROM (WITH tmp as (select * from employee) select empname from tmp) AS ST"

Hence, the root cause is not clear and need to be analysed so that it work with SPARK as well. Please suggest further.

Arghya Saha
  • 227
  • 1
  • 4
  • 17
  • 1
    Why are you connecting to impala via JDBC? Do you not have a hiveContext configured to access the hive tables directly from spark? Also, what is the count of the df you get in the `nonWorkingSQL` case? – philantrovert Aug 09 '17 at 14:13
  • There are few issues with hiveContext and also it takes much more time while executing complex SQL consisting multiple joins. Hence Impala is the strategic solution for this. To answer your question both SQL returns same and correct count. – Arghya Saha Aug 09 '17 at 14:28
  • Can you try adding the jdbc connection string argument 'UseNativeQuery=1'? e.g. jdbc:impala://impalad_host:21050;UseNativeQuery=1 – Matt Aug 09 '17 at 15:02
  • @Matt I have already used the same with no luck. – Arghya Saha Aug 09 '17 at 15:25
  • Note that Spark wraps a "SELECT * FROM ( )" around your query -- how does the Impala shell react to that wrapper, with the WITH now inside a sub-query?? – Samson Scharfrichter Aug 09 '17 at 19:11
  • Did you try to set up a plain JDBC connection to Impala (with some custom scala code), run a "CREATE VIEW" with a random name and your custom query, then generate the DataFrame from that view, and finally "DROP VIEW"? – Samson Scharfrichter Aug 09 '17 at 19:17
  • @SamsonScharfrichter Yes Spark wraps "SELECT * FROM ()", hence i have tried to setup a plain JDBC connection and run the following query and surprisingly it worked! workingSQL = " SELECT * FROM (WITH tmp as (select * from employee) select empname from tmp) AS ST". Now we can't conclude the wrapper is causing the issue. Whats your view on it ? And regarding the Approach of view, unfortunately i am supposed to use read-only user to implement the solution. Please suggest. – Arghya Saha Aug 09 '17 at 19:28
  • @SamsonScharfrichter Is there any way to overwrite definition of function sqlContext.read.jdbc to remove the wrapper in my code ? – Arghya Saha Aug 09 '17 at 20:51
  • 1
    Back to basics: If it's the WITH syntax that triggers the issue, then don't use it. People have written SQL queries for decades without it. – Samson Scharfrichter Aug 09 '17 at 21:56
  • @SamsonScharfrichter Unfortunately not using WITH is not a option for me as i cant restrict users. Basically i am working for a framework which takes SQLs as input. – Arghya Saha Aug 10 '17 at 03:33
  • You're accepting all dialects of SQL, or HiveQL only? Or more specifically, Impalas extension of that? – OneCricketeer Aug 26 '17 at 07:04
  • @cricket_007 I am accepting all dialects of SQL, however is it considered SQL's will be compatible with the driver used. – Arghya Saha Aug 26 '17 at 07:11
  • Alright, and have you tried making a JIRA issue for Spark pointing out these issues you've found? The community might benefit from it rather than hacking around the issue – OneCricketeer Aug 26 '17 at 07:18
  • @cricket_007 Agreed, and i will definitely do this, but i have to hack around the issue as even if Spark provides a fix it will fixed in later version and unfortunately i do not have right to update Spark in corporate environment. – Arghya Saha Aug 26 '17 at 07:21
  • You don't need to upgrade the corporate cluster . If you have download rights, you can simply extract and configure spark 2 locally. Think of your machine as the "edge node" – OneCricketeer Aug 26 '17 at 07:24
  • @cricket_007 Thanks for your suggestion, i will definitely try that and considering i can use yarn as master, however its a long process and i will have to migrate the code to Spark 2.x . Meanwhile, any workaround will be great help! – Arghya Saha Aug 26 '17 at 07:39

0 Answers0