9

I'm looking for a client jdbc driver that supports Spark SQL.

I have been using Jupyter so far to run SQL statements on Spark (running on HDInsight) and I'd like to be able to connect using JDBC so I can use third-party SQL clients (e.g. SQuirreL, SQL Explorer, etc.) instead of the notebook interface.

I found an ODBC driver from Microsoft but this doesn't help me with java-based SQL clients. I also tried downloading the Hive jdbc driver from my cluster, but the Hive JDBC driver does not appear to support more advance SQL features that Spark does. For example, the Hive driver complains about not supporting join statements that are not equajoins, where I know that this is a supported feature of Spark because I've executed the same SQL in Jupyter successfully.

OneCricketeer
  • 179,855
  • 19
  • 132
  • 245
aaronsteers
  • 2,277
  • 2
  • 21
  • 38
  • Questions asking for recommendations or help with finding a library or another off-site resources are off topic. – Mark Rotteveel Jun 09 '16 at 18:42
  • 1
    http://www.simba.com/drivers/spark-jdbc-odbc/ Simba’s Apache Spark ODBC and JDBC Drivers efficiently map SQL to Spark SQL by transforming an application’s SQL query into the equivalent form in Spark SQL, enabling direct standard SQL-92 access to Apache Spark distributions. The – kliew Jun 10 '16 at 07:06
  • I would try the hive jdbc driver to talk to it. – lockwobr Jun 10 '16 at 14:07
  • @kliew - Simba driver is expensive, and I was hoping for for something that's part of the platform. Sounds like this is not available today, and although the hive driver ships as part of the stack, there is no spark jdbc driver available in a similar capacity. – aaronsteers Jun 13 '16 at 02:58
  • @lockwobr - Problem with the hive driver is that it doesn't accept the broader SQL features supported today by Spark. I'm confused why the hive jdbc driver is included as downloadable component on the server, but nothing similar on the spark sql side. Maybe it's just a matter of time?... – aaronsteers Jun 13 '16 at 03:01
  • I submitted an HDInsight feature request here: https://feedback.azure.com/forums/34192--general-feedback/suggestions/14794632-create-a-jdbc-driver-for-spark-on-hdinsight – aaronsteers Jun 13 '16 at 03:44
  • so when you start [beeline](http://spark.apache.org/docs/latest/sql-programming-guide.html#running-the-thrift-jdbcodbc-server) up that comes with spark this is what the java command looks like `/usr/jdk64/jdk1.7.0_67/bin/java -cp $SPARK_HOME/conf/:$SPARK_HOME/lib/spark-assembly-1.6.1-hadoop2.6.0.jar:$SPARK_HOME/lib/datanucleus-api-jdo-3.2.6.jar:$SPARK_HOME/lib/datanucleus-rdbms-3.2.9.jar:$SPARK_HOME/lib/datanucleus-core-3.2.10.jar:/usr/hdp/current/hadoop-client/conf/ -Xms1g -Xmx1g -XX:MaxPermSize=256m org.apache.hive.beeline.BeeLine` maybe one of these jars has the magic in them – lockwobr Jun 13 '16 at 17:58

1 Answers1

1

the Hive JDBC driver does not appear to support more advance SQL features that Spark does

Regardless of the support that it provides, the Spark Thrift Server is fully compatible with Hive/Beeline's JDBC connection.

Therefore, that is the JAR you need to use. I have verified this works in DBVisualizer.

The alternative solution would be to run Spark code in your Java clients (non-third party tools) directly and skip the need for the JDBC connection.

OneCricketeer
  • 179,855
  • 19
  • 132
  • 245
  • How to run Spark code in your Java clients ? How are the queries submitted? – user1870400 Aug 20 '17 at 00:40
  • You just compile and run it... Feel free to post your own question outside the comments to get more in depth answers – OneCricketeer Aug 20 '17 at 02:00
  • I am not sure how one can just compile and run without going through spark-submit? Spark-submit has its own class loader which is not the default java class loader. – user1870400 Aug 29 '17 at 09:35
  • I've setup both IntelliJ and Eclipse for Java/Scala and Hue/Jupyter/Zeppelin for Python/Scala/R. They don't use spark-submit – OneCricketeer Aug 29 '17 at 16:37
  • are you running your java code from outside the cluster or inside the cluster ? after compilation ? I know you can use Livy to connect spark as rest service from outside the cluster how did you achieved it without a JDBC driver ? – sri hari kali charan Tummala Apr 17 '18 at 21:27
  • @sri Well, you can only run code after compilation. JDBC has nothing to do with adding Spark code into an existing JVM application. – OneCricketeer Apr 17 '18 at 22:33
  • @cricket_007 , my question is how do you get data from hive table querying outside the cluster using spark on client (windows for example) ? you have to use hive or spark JDBC drivers right ? we right now connect impala from outside the cluster using impala JDBC driver and j certificates from the cluster what we are connecting – sri hari kali charan Tummala Apr 18 '18 at 18:53
  • @sri There is no "Spark" JDBC driver. The Hive JDBC driver connects to the Spark ThriftServer, which is linked to in my question. You can connect Tableau or other BI tools to that, for example. – OneCricketeer Apr 18 '18 at 19:58