How Can I Query an External Hive Table Using Databricks

Question

I have a Databricks cluster running on some VMs. My organisation has a Hadoop cluster with a bunch a data in it that I want. I have no access to the Hadoop cluster, just a JDBC URL (all my permissions have been sorted out, they gave me just a URL).

I can open up a database management tool on my local machine (Dbeaver), and query Hive tables successfully.

However I am struggling to query Hive tables using Databricks and PySpark. It seems that to set the connection string for the HiveContext, I would normally write it in the hive-site.xml file. However Databricks doesn't give me this option.

I am on Hive 2.1.1 & Databricks 6.4 (includes Apache Spark 2.4.5, Scala 2.11)

Now I am at a loss on how to simply connect to my Hive database.

# Spark context sc is implicit in Databricks
hive_context = spark.HiveContext(sc)

# I want to be able to do something like
df = hive_context.sql("SELECT...")

https://docs.databricks.com/data/metastores/external-hive-metastore.html — thebluephantom, Apr 16 '20 at 10:53
@thebluephantom I followed this tutorial already and A) it doesn't give any information about the pyspark syntax involved and B) I came across errors because of driver compatibility; this tutorial assumes the hive database is stored in Azure and specifies drivers for that use case. I don't know how to determine suitable drivers. Is that something you know how to fix? — Alan, Apr 16 '20 at 11:17
sys admin guys do that. i am actually installing a new system but with no hive. in spark 2 it is just a session parameter and it should be all set up... — thebluephantom, Apr 16 '20 at 11:27
Cool... hmmm I don't fully understand what you mean there about sys admins. But maybe I'll ask... after playing around I found that the Databricks UI let's me specify session parameters for HiveContext. The parameter in question is spark.hadoop.javax.jdo.option.ConnectionDriverName... this is where the above tutorial specifies SQl drivers. Given I know my ConnectionURL is it possible to figure out what the ConnectionDriverName should be? — Alan, Apr 16 '20 at 11:59
you mean a remote hive cluster relative to databricks which has dbfs as hdfs equivalent? — thebluephantom, Apr 16 '20 at 18:10
I deleted my answer as I could not get the spark.table option to work. may be it is not possible, looked very convoluted. Did you progress? If so, pls post own answer as not all want hive beeline access. — thebluephantom, Apr 18 '20 at 10:03
Thanks @thebluephantom for taking time with this question, but yes I agree I couldn't get your answer to work either. I have put a bounty on this question. I am working on my own fix too, and I might have it (but it's the weekend now and I want to drink XD), so I'll post my update if I make progress. — Alan, Apr 19 '20 at 10:39
I know Spark can do cluster to cluster as in that example, but that was at file level, not table level. I am also curious to see how it would work. Also I looked at DB stuff, but hard to assess. I am wondering if spark.table is intra cluster onlky. Success. — thebluephantom, Apr 19 '20 at 10:45

score 0 · Answer 1 · answered Apr 24 '20 at 09:36

This is the most anti-climactic answer, but all along it was a firewall issue inside my organisation, not the jar files. My organisation permits and blocks access to and from different sources as needed. As embarrassing as this is, please believe me that it was non-obvious from the error!

Databricks has many of these jar files preloaded. So this code would have been sufficient to establish a connection all along.

import java.sql.DriverManager
val connection = DriverManager.getConnection("jdbc:hive2://hostname:10000/dbname", "usrname", "psswrd")
connection.isClosed()

This is an example of me Googling errors and following clues around various forums, including SO, to no avail. I spoke to a Databricks SME to conclude this network issue.

So, are you in the notebook able to do spark.table(...) ? Can you join over the two? I suspect so, but never tried. Or are you reading as jdbc source? — thebluephantom, Apr 25 '20 at 11:09

score 0 · Answer 2 · answered Apr 25 '20 at 22:36

By specifing the configation for hive.metastore.uris during sparksession builder call, will solve your problem incase you are running from any other distribution, but for databricks you have to edit the spark config for your cluster.

You can refer below answers-

How to connect Spark SQL to remote Hive metastore (via thrift protocol) with no hive-site.xml?

How to add Spark configuration in Databricks cluster

How Can I Query an External Hive Table Using Databricks

2 Answers2