We recently enabled Kerberos authentication on our Spark cluster, but we found that when we submit Spark jobs in cluster mode, the code cannot connect to Hive. Should we be using Kerberos to authenticate to Hive, and if yes, how? As detailed below, I think we have to specify keytab and principal, but I don't know what exactly.
This is the exception we get:
Traceback (most recent call last):
File "/mnt/resource/hadoop/yarn/local/usercache/sa-etl/appcache/application_1649255698304_0003/container_e01_1649255698304_0003_01_000001/__pyfiles__/utils.py", line 222, in use_db
spark.sql("CREATE DATABASE IF NOT EXISTS `{db}`".format(db=db))
File "/usr/hdp/current/spark3-client/python/pyspark/sql/session.py", line 723, in sql
return DataFrame(self._jsparkSession.sql(sqlQuery), self._wrapped)
File "/usr/hdp/current/spark3-client/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305, in __call__
File "/usr/hdp/current/spark3-client/python/pyspark/sql/utils.py", line 117, in deco
raise converted from None
pyspark.sql.utils.AnalysisException: java.lang.RuntimeException: java.io.IOException: DestHost:destPort hn1-pt-dev.MYREALM:8020 , LocalHost:localPort wn1-pt-dev/10.208.3.12:0. Failed on local exception: java.io.IOException: org.apache.hadoop.security.AccessControlException: Client cannot authenticate via:[TOKEN, KERBEROS]
Additionally, I saw this exception:
org.apache.hadoop.security.AccessControlException: Client cannot authenticate via:[TOKEN, KERBEROS], while invoking ClientNamenodeProtocolTranslatorPB.getFileInfo over hn0-pt-dev.myrealm/10.208.3.15:8020
This is the script that produces the exception, that as you can see, happens on the CREATE DATABASE
:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('Test').enableHiveSupport().getOrCreate()
spark.sql("CREATE DATABASE IF NOT EXISTS TestDb")
Environment and relevant information
We have an ESP enabled HDInsight Cluster in Azure, it is inside a virtual network. AADDS works fine for logging into the cluster. The cluster is connected to a Storage Account, communicating to it with ABFS and storing the Hive warehouse on there. We are using Yarn. We want to execute Spark jobs using PySpark from the Azure Data Factory, which uses Livy, but if we can get it to work with spark-submit cli it will hopefully also work with Livy. We are using Spark 3.1.1 and Kerberos 1.10.3-30.
The exception only occurs when we use spark-submit --deploy-mode cluster
, when using client mode there is no exception and the database is created.
When we remove the .enableHiveSupport
the exception also disappears, so it apparently has something to do with the authentication to Hive.
We do need the Hive warehouse though, because we need to access tables from within multiple Spark sessions so they need to be persisted.
We can access HDFS, also in cluster mode, as sc.textFile('/example/data/fruits.txt').collect()
works fine.
Similar questions and possible solutions
In the exception, I see that it is the worker node which tries to access the head node. The port is 8020, which is I think the namenode port, so this sounds indeed HDFS related - except that to my understanding we can access HDFS, but not Hive.
- https://spark.apache.org/docs/latest/running-on-yarn.html#kerberos It suggests specifying principal and keytab file explicitly, so I found the keytab file with
klist -k
and added to the spark-submit command line--principal myusername@MYREALM --keytab /etc/krb5.keytab
, which is the same keytab file as in one of the linked questions below, however I got
Exception in thread "main" org.apache.hadoop.security.KerberosAuthException: failure to login: for principal: myusername@MYREALM from keytab /etc/krb5.keytab javax.security.auth.login.LoginException: Unable to obtain password from user
Maybe I have the wrong keytab file though, because when I klist -k /etc/krb5.keytab
the file I only get slots with entries like HN0-PT-DEV@MYREALM
and host/hn0-pt-dev.myrealm@MYREALM
.
If I look in the keytabs for hdfs/hive in /etc/security/keytabs
I also see only entries for hdfs/hive users.
When I try adding all the extraJavaOptions specified in How to use Apache Spark to query Hive table with Kerberos? but don't specify principal/keytab, I get KrbException: Cannot locate default realm
even though the default realm in /etc/krb5.conf
is correct.
In Ambari, I can see the settings spark.yarn.keytab={{hive_kerberos_keytab}}
and spark.yarn.principal={{hive_kerberos_principal}}
.
- https://learn.microsoft.com/en-us/azure/hdinsight/hdinsight-faq#how-do-i-create-a-keytab-for-an-hdinsight-esp-cluster- I created a keytab for my user and specified that file instead, but that didn't help.
It appears that many other answers/websites also suggest to specify principal/keytab explicitly:
- Spark on YARN + Secured hbase For HBase instead of Hive, but same conclusion.
- https://www.ibm.com/docs/en/spectrum-conductor/2.4.1?topic=ssbaig-submitting-spark-batch-applications-kerberos-enabled-hdfs-keytab
- Issue with Spark Java API, Kerberos, and Hive
- spark-submit failing to connect to metastore due to Kerberos : Caused by GSSException: No valid credentials provided . but works in local-client mode
- https://docs.cloudera.com/documentation/enterprise/5-7-x/topics/sg_spark_auth.html#concept_bvc_pcy_dt (I couldn't find similar documentation from Microsoft)
- spark-submit,Client cannot authenticate via:[TOKEN, KERBEROS];
Other questions:
- https://spark.apache.org/docs/2.1.1/running-on-yarn.html#running-in-a-secure-cluster To start with the official documentation: it explains that
For a Spark application to interact with HDFS, HBase and Hive, it must acquire the relevant tokens using the Kerberos credentials of the user launching the application —that is, the principal whose identity will become that of the launched Spark application. This is normally done at launch time: in a secure cluster Spark will automatically obtain a token for the cluster’s HDFS filesystem, and potentially for HBase and Hive.
Well, the user launching the application has valid ticket, as can be seen in the output of klist
. The user has contributor access to the blob storage (not sure if that is actually needed). I don't understand what is meant with "Spark will automatically obtain a token for Hive [at launch time]" though. I did restart all services on the cluster, but that didn't help.
- Kerberos authentication with Hadoop cluster from Spark stand alone cluster running on Kubernetes cluster This is a situation with two clusters. As explained here:
in yarn-cluster mode, the Spark client uses the local Kerberos ticket to connect to Hadoop services and retrieve special auth tokens that are then shipped to the YARN container running the driver; then the driver broadcasts the token to the executors
- When running Spark on Kubernetes to access kerberized Hadoop cluster, how do you resolve a "SIMPLE authentication is not enabled" error on executors? For older Spark version.
- Cannot connect to HIVE with Secured kerberos. I am using UserGroupInformation.loginUserFromKeytab() Something about JAAS
- Spark-submit job fails on yarn nodemanager with error Client cannot authenticate via:[TOKEN, KERBEROS] No answer
- Client cannot authenticate via: [TOKEN, KERBEROS) Not making sense to me.
- Hive is not accessible via Spark In Kerberos Environment : Client cannot authenticate via:[TOKEN, KERBEROS] Added spark.security.credentials.hadoopfs.enabled=true
- https://funclojure.tumblr.com/post/155129283948/hdfs-kerberos-java-client-api-pains about jars
- org.apache.hadoop.security.AccessControlException: Client cannot authenticate via:[TOKEN, KERBEROS] Issue No answer
- https://issues.apache.org/jira/browse/SPARK-27554 No answer
- java.io.IOException: org.apache.hadoop.security.AccessControlException: Client cannot authenticate via:[TOKEN, KERBEROS] old
Possible things to try:
- https://spark.apache.org/docs/2.1.1/running-on-yarn.html#troubleshooting-kerberos Enable more in-detail logging.
- https://learn.microsoft.com/en-us/azure/hdinsight/hdinsight-linux-ambari-ssh-tunnel Viewing the Namenode UI might give some information
Updates
When logged in as Hive user:
kinit
then supply hive
password:
Password for hive/hn0-pt-dev.myrealm@MYREALM:
kinit: Password incorrect while getting initial credentials
hive@hn0-pt-dev:/tmp$ klist -k /etc/security/keytabs/hive.service.keytab
Keytab name: FILE:/etc/security/keytabs/hive.service.keytab
KVNO Principal
---- --------------------------------------------------------------------------
0 hive/hn0-pt-dev.myrealm@MYREALM
0 hive/hn0-pt-dev.myrealm@MYREALM
0 hive/hn0-pt-dev.myrealm@MYREALM
0 hive/hn0-pt-dev.myrealm@MYREALM
0 hive/hn0-pt-dev.myrealm@MYREALM
hive@hn0-pt-dev:/tmp$ kinit -k /etc/security/keytabs/hive.service.keytab
kinit: Client '/etc/security/keytabs/hive.service.keytab@MYREALM' not found in Kerberos database while getting initial credentials