10

I have a simple Java application that can connect and query my cluster using Hive or Impala using code like

import java.sql.Connection;
import java.sql.DriverManager;
import java.sql.ResultSet;
import java.sql.SQLException;
import java.sql.Statement;

...

Class.forName("com.cloudera.hive.jdbc41.HS2Driver");
Connection con = DriverManager.getConnection("jdbc:hive2://myHostIP:10000/mySchemaName;hive.execution.engine=spark;AuthMech=1;KrbRealm=myHostIP;KrbHostFQDN=myHostIP;KrbServiceName=hive");
Statement stmt = con.createStatement();

ResultSet rs = stmt.executeQuery("select * from foobar");

But now I want to try doing the same query but with Spark SQL. I'm having a hard time figuring out how to use the Spark SQL API though. Specifically how to setup the connection. I see examples of how to setup the Spark Session but it's unclear what values I need to provide for example

  SparkSession spark = SparkSession
  .builder()
  .appName("Java Spark SQL basic example")
  .config("spark.some.config.option", "some-value")
  .getOrCreate();

How do I tell Spark SQL what Host and Port to use, what Schema to use, and how do I tell Spark SQL which authentication technique I'm using? For example I'm using Kerberos to authenticate.

The above Spark SQL code is from https://github.com/apache/spark/blob/master/examples/src/main/java/org/apache/spark/examples/sql/JavaSparkSQLExample.java

UPDATE:

I was able to make a little progress and I think I figured out how to tell the Spark SQL connection what Host and Port to use.

...

SparkSession spark = SparkSession
.builder()
.master("spark://myHostIP:10000")
.appName("Java Spark Hive Example")
.enableHiveSupport()
.getOrCreate();

And I added the following dependency in my pom.xml file

<dependency>
   <groupId>org.apache.spark</groupId>
   <artifactId>spark-hive_2.11</artifactId>
   <version>2.0.0</version>
</dependency>

With this update I can see that the connection is getting further but it appears it's now failing because I'm not authenticated. I need to figure out how to authenticate using Kerberos. Here's the relevant log data

2017-12-19 11:17:55.717  INFO 11912 --- [o-auto-1-exec-1] org.apache.spark.util.Utils              : Successfully started service 'SparkUI' on port 4040.
2017-12-19 11:17:55.717  INFO 11912 --- [o-auto-1-exec-1] org.apache.spark.ui.SparkUI              : Bound SparkUI to 0.0.0.0, and started at http://myHostIP:4040
2017-12-19 11:17:56.065  INFO 11912 --- [er-threadpool-0] s.d.c.StandaloneAppClient$ClientEndpoint : Connecting to master spark://myHostIP:10000...
2017-12-19 11:17:56.260  INFO 11912 --- [pc-connection-0] o.a.s.n.client.TransportClientFactory    : Successfully created connection to myHostIP:10000 after 113 ms (0 ms spent in bootstraps)
2017-12-19 11:17:56.354  WARN 11912 --- [huffle-client-0] o.a.s.n.server.TransportChannelHandler   : Exception in connection from myHostIP:10000

java.io.IOException: An existing connection was forcibly closed by the remote host
Kyle Bridenstine
  • 6,055
  • 11
  • 62
  • 100
  • 1
    Spark has native integration to Hive Metastore and to HDFS. In other works it _should not_ use a JDBC connection to Impala or HS2; because it replaces them. – Samson Scharfrichter Dec 14 '17 at 22:42
  • What does that mean? – Kyle Bridenstine Dec 14 '17 at 22:43
  • 1
    From Linux, define env var HADOOP_CONF_DIR to point to the directory (or directories) containing conf files for Hadoop and Hive. From Windows... it's way more complicated because of the way Hadoop implements Kerberos auth. – Samson Scharfrichter Dec 14 '17 at 22:48
  • Yeah sadly I’m on Windows. It seems like Spark SQL isn’t that well documented for Java... Compared to Hive and Impala. – Kyle Bridenstine Dec 14 '17 at 22:49
  • 2
    Read about `HiveContext` -- if you have a Spark build with Hive integration, and the conf files are detected on startup, then it's just as simple as starting a spark-shell and typing `spark.sql("show databases").show` – Samson Scharfrichter Dec 14 '17 at 22:54
  • 2
    For Windows tweaks, find the GitBook by Jacek Laskowski _"Mastering Apache Spark 2"_ and go straight to "Running Spark apps on Windows". I could do some nit-picking about using HADOOP_HOME and PATH instead of -Dhadoop.home and -Djava.library.path but his advice just works. – Samson Scharfrichter Dec 14 '17 at 22:58
  • 2
    @SamsonScharfrichter I've created a 50 point bounty on this question if you would like to create an answer. I just realized that book you mentioned is free online so I'm reading it now. – Kyle Bridenstine Dec 18 '17 at 20:11
  • 1
    FYI for anyone having this problem this SOF post is helpful https://stackoverflow.com/questions/39444493/unable-to-instantiate-sparksession-with-hive-support-because-hive-classes-are-no – Kyle Bridenstine Dec 18 '17 at 21:03
  • 1
    FYI here's another helpful link https://github.com/apache/spark/blob/master/docs/sql-programming-guide.md – Kyle Bridenstine Dec 18 '17 at 21:33
  • So all @SamsonScharfrichter has to do is wrap his comments into an answer and you'll accept it? Hopefully he comes back to this one and does that. If you're satisfied with it those bounty points will be a well-deserved award for his work here. – T-Heron Dec 21 '17 at 02:49
  • The answer to your question is the top answer available here: https://stackoverflow.com/questions/31980584/how-to-connect-to-a-hive-metastore-programmatically-in-sparksql – Johan Witters Dec 25 '17 at 08:11

2 Answers2

1

You can try to do Kerberos login before running connection:

        Configuration conf = new Configuration();
        conf.set("fs.hdfs.impl", DistributedFileSystem.class.getName());            
        conf.addResource(pathToHdfsSite);
        conf.addResource(pathToCoreSite);
        conf.set("hadoop.security.authentication", "kerberos");
        conf.set("hadoop.rpc.protection", "privacy");
        UserGroupInformation.setConfiguration(conf);
        UserGroupInformation.loginUserFromKeytab(ktUserName, ktPath);
        //your code here

ktUserName is principal here, like - user@TEST.COM And you need to have core-site.xml, hdfs-site.xml and keytab at your machine to run this.

Oleg
  • 11
  • 2
0

Dataframe creation using Impala with Kerberos authentication

I am able to do Impala connection with kerberos authentication. Checkout my git repo here. Maybe this will be of some help.

morfious902002
  • 916
  • 1
  • 11
  • 29