Kerberos Authentication in Python/Pyspark using keytab

Question

I'm trying to connect to Hadoop/Hive from Intelliji using keytab based authentication in python/pyspark. I use statement in scala to get kerberos ticket but is there any similar way in Python as well to do kerberos authentication.

UserGroupInformation.loginUserFromKeytab(principal, keytabPath)

From PySpark, you can use the Py4J gateway to invoke any Java method - the syntax is kind of hacky but it works. Google about that. — Samson Scharfrichter, Jun 21 '19 at 11:15
hmm I was able to call scala method from python to renew kerberos ticket but it seems pyspark is not recognizing this token. i.e I'm running into below error python/src/main/python/com/chase/py4j_test.py 2019-06-29 03:05:47 ERROR TSaslTransport:315 - SASL negotiation failure javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)] Any thought on this? — Ponns, Jun 29 '19 at 07:10
Actually the right question to ask was _"how does Spark manage Kerberos auth from its own distributed computing engine to Hadoop distributed computing platform, although Kerberos is for point-to-point auth and **not** for this kind of mess?"_ — Samson Scharfrichter, Jul 01 '19 at 17:46
And the answer is: **before starting the Driver**, the Spark "client" connects to the cluster and retrieves auth **tokens** to HDFS/Yarn, optionally to HiveMetastore, optionally to HBase. The tokens are broadcast to Driver and Executors, and valid against any Hadoop node for typically 24h. — Samson Scharfrichter, Jul 01 '19 at 17:50
So it's too late (and not enough) for you to set up the Hadoop UGI in the driver. Either you create a Kerberos ticket before starting the Spark job, or you pass `principal` and `keytab` as Spark properties, consumed by the "client" — Samson Scharfrichter, Jul 01 '19 at 17:54
I'm passing the principle and keyab in scala code then below is a python code I'm verifying to make sure the keytab is still valid, gateway = gateway = JavaGateway(gateway_parameters=GatewayParameters(port=25335)) stack = gateway.entry_point.getStack() print("isFromKeytab =" , stack.isFromKeytab() ,"CurrentUser=",stack.getCurrentUser() ,",isLoginKeytabBased = ",stack.isLoginKeytabBased()) the output of above lines is below, isFromKeytab = True CurrentUser= a_myfid@NAEAST.AD.JPMORGANCHASE.COM (auth:KERBEROS) ,isLoginKeytabBased = True then connecting to hdfs/hive — Ponns, Jul 01 '19 at 19:32
Read again my comments above. **Spark uses Kerberos at startup only** and does not give a shot at your custom ticket. — Samson Scharfrichter, Jul 02 '19 at 08:20
The only thing that would use your custom Kerberos ticket is a JDBC connection (say, to MS SQL Server) with Kerberos auth, because the JDBC driver would manage its own auth with core JAAS libraries. — Samson Scharfrichter, Jul 02 '19 at 08:21

Kerberos Authentication in Python/Pyspark using keytab

0 Answers0