Connect to Spark thriftserver via Knox

Question

I'm trying to connect to a SparkSQL thriftserver (Spark 1.6.2) via Knox in a cluster secured with Kerberos (the Hadoop distribution is HDP 2.4.2). We have the same architecture for Hive and it is working fine. Since Spark uses the same thriftserver, I thought that doing the same thing would have been trivial, but indeed it's not.

The error thrown by Spark thriftserver when connecting via Knox is:

16/10/17 15:25:39 ERROR ThriftHttpServlet: Failed to authenticate with hive/_HOST kerberos principal
16/10/17 15:25:39 ERROR ThriftHttpServlet: Error: 
org.apache.hive.service.auth.HttpAuthenticationException: java.lang.reflect.UndeclaredThrowableException
at org.apache.hive.service.cli.thrift.ThriftHttpServlet.doKerberosAuth(ThriftHttpServlet.java:361)
at org.apache.hive.service.cli.thrift.ThriftHttpServlet.doPost(ThriftHttpServlet.java:136)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:755)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:848)
at org.spark-project.jetty.servlet.ServletHolder.handle(ServletHolder.java:684)
at org.spark-project.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:501)
at org.spark-project.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:229)
at org.spark-project.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1086)
at org.spark-project.jetty.servlet.ServletHandler.doScope(ServletHandler.java:428)
at org.spark-project.jetty.server.session.SessionHandler.doScope(SessionHandler.java:193)
at org.spark-project.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1020)
at org.spark-project.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)
at org.spark-project.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)
at org.spark-project.jetty.server.Server.handle(Server.java:366)
at org.spark-project.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:494)
at org.spark-project.jetty.server.AbstractHttpConnection.content(AbstractHttpConnection.java:982)
at org.spark-project.jetty.server.AbstractHttpConnection$RequestHandler.content(AbstractHttpConnection.java:1043)
at org.spark-project.jetty.http.HttpParser.parseNext(HttpParser.java:957)
at org.spark-project.jetty.http.HttpParser.parseAvailable(HttpParser.java:240)
at org.spark-project.jetty.server.AsyncHttpConnection.handle(AsyncHttpConnection.java:82)
at org.spark-project.jetty.io.nio.SelectChannelEndPoint.handle(SelectChannelEndPoint.java:667)
at org.spark-project.jetty.io.nio.SelectChannelEndPoint$1.run(SelectChannelEndPoint.java:52)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
Caused by: java.lang.reflect.UndeclaredThrowableException
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1727)
at org.apache.hive.service.cli.thrift.ThriftHttpServlet.doKerberosAuth(ThriftHttpServlet.java:358)
... 24 more
Caused by: org.apache.hive.service.auth.HttpAuthenticationException: Authorization header received from the client is empty.
at org.apache.hive.service.cli.thrift.ThriftHttpServlet.getAuthHeader(ThriftHttpServlet.java:502)
at org.apache.hive.service.cli.thrift.ThriftHttpServlet.access$100(ThriftHttpServlet.java:68)
at org.apache.hive.service.cli.thrift.ThriftHttpServlet$HttpKerberosServerAction.run(ThriftHttpServlet.java:403)
at org.apache.hive.service.cli.thrift.ThriftHttpServlet$HttpKerberosServerAction.run(ThriftHttpServlet.java:366)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1709)
... 25 more

Does anybody have any idea about this and how to fix it?

Thank you, Marco

score 2 · Answer 1 · answered Nov 07 '16 at 12:23

Like in HiveServer2 the empty client authorization may actually be a red herring. The first HTTP request doesn't have the header but it is generally sent after the SPNEGO challenge from the server.

I wasn't actually aware that the SparkSQL thrift server could be used in the same way that Hive can be. Do you know whether it has Trusted Proxy support - as is implemented in many services in Hadoop? This is what allows a third part component such as Apache Knox to act on behalf of another user by asserting the authenticated user's name via doAs query param. It also assures that the doAs is coming from an identity it trusts. In this case, via kerberos/SPNEGO authentication.

If it doesn't have support for Trusted Proxies then it will not work straight out of the box. Either it would need to be added to SparkSQL thrift server or a custom dispatch provider created for SparkSQL in Knox. The custom dispatch would allow us to propagate the user identity as expected by SparkSQL.

Hope that is helpful.

--larry

Connect to Spark thriftserver via Knox

1 Answers1