0

I could not find anything on this after hours of Google search so I hope I can get some ideas to my problem here.

I am trying to get data from a remote hive cluster using spark2. I have followed:

  1. How to connect to a Hive metastore programmatically in SparkSQL?
  2. How to connect to remote hive server from spark

And I was able to connect to the remote hive metastore successfully.

However, my problem starts when I execute a query in the remote hive. e.g spark.sql("select count(*) from table"). I will get an "unknown host: ns-bigdata" error. Where ns-bigdata is the cluster name of the remote cluster.

What other things am I missing here? Need I specify where the hive.metastore.warehouse.dir should be as well? e.g. hdfs://local-cluster:8020/user/hive/warehouse

Thanks in advance.

Kok-Lim Wong
  • 103
  • 1
  • 10
  • Sounds like your DNS server is not working. Try using IP addresses – OneCricketeer Jan 10 '20 at 14:54
  • Don't think it's the DNS as my spark session is able to connect to the remote hive metastore with the hostname i.e spark.config("spark.hadoop.hive.metastore.uri", "thrift://remote.hive.domain:9083"). – Kok-Lim Wong Jan 10 '20 at 15:00
  • That's just a string. The connection is not attempted until you actually run a query – OneCricketeer Jan 10 '20 at 15:06
  • 1
    Try running simpler query spark.sql("show databases").show() to make sure the connection is fine. If this works fine, include database name also in the query. spark.sql("select count(*) from database.table") Also, to be clear the machine you are running spark2-submit or spark2-shell is not present in the cluster "ns-bigdata". – yammanuruarun Jan 11 '20 at 13:55
  • After some thinking I think @cricket_007 may be right. Think when I try to run a query, hive is trying to access the warehouse directory in hdfs to check the schema but could not find where it is because my spark cluster doesn't know where ns-bigdata is. I'll try to see if I can get the IP of ns-bigdata and try to put in in my host file of my cluster. – Kok-Lim Wong Jan 15 '20 at 14:09
  • found out the PIC did not configure cross realm authentication in Kerberos. which is why we can't connect to the hive thrift server with spark. – Kok-Lim Wong Jul 14 '20 at 15:35

2 Answers2

0

The hive server URL is in the hive site. Can you try and use that?? Also do check if hive-site.xml is present in the conf/ directory of spark

Yayati Sule
  • 1,601
  • 13
  • 25
0

The real reason was the customer did not set their kerberos cert in the hive thrift server for cross realm authentication. We ended up using jdbc impala.

Kok-Lim Wong
  • 103
  • 1
  • 10