3

We have two cloudera 5.7.1 clusters, one secured using Kerberos and one unsecured.

Is it possible to run Spark using the unsecured YARN cluster while accessing hive tables stored in the secured cluster? (Spark version is 1.6)

If so, can you please provide some explanation on how can I get it configured?

Update:

I want to explain a little the end goal behind my question. Our main secured cluster is heavily utilized and our jobs can't get enough resources to complete in a reasonable time. In order to overcome this, we wanted to use resources from another unsecured cluster we have without needing to copy the data between the clusters.

We know it's not the best solution as the data locality level might not be optimal, however that's the best solution we can come up for now.

Please let me know if you have any other solution as it seems like we can't achieve the above.

Koby
  • 605
  • 5
  • 11
  • On second thoughts, you can run your Spark driver against the remote, secure cluster... and download the results on your local machine. But then you would need another job to upload these results to the unsecure HDFS. – Samson Scharfrichter Mar 07 '17 at 20:31
  • It's not that good for our use case. Please see my updated question. – Koby Mar 08 '17 at 08:10
  • Using an unsecure cluster to process secure data, on a day-to-day basis? That defeats the purpose of securing data! It is that simple: either you admit that you can't afford security, or you add resources to your secure cluster. Could be compute-only nodes, just for YARN, with a small disk capacity. – Samson Scharfrichter Mar 08 '17 at 10:31
  • In other words, you could "cannibalize" some nodes from your unsecure cluster, i.e. decommission them *(drain HDFS blocks to the remaining nodes, stop YARN & HDFS services, blacklist the nodes in RM & NN config)* then add them to the secure cluster, only for YARN, with the proper Kerberos config. Now you have extra computing power, with no locality in data access, but if your network bandwidth is adequate it should not be too bad. – Samson Scharfrichter Mar 08 '17 at 10:39

1 Answers1

3

If you run Spark in local mode, you can make it use an arbitrary set of Hadoop conf files -- i.e. core-site.xml, hdfs-site.xml, mapred-site.xml, yarn-site.xml, hive-site.xml copied from the Kerberized cluster.
So you can access HDFS on that cluster -- if you have a Kerberos ticket that grants you access to that cluster, of course.

  export HADOOP_CONF_DIR=/path/to/conf/of/remote/kerberized/cluster
  kinit sylvestre@WORLD.COMPANY
  spark-shell --master local[*]

But in yarn-client or yarn-cluster mode, you cannot launch containers in the local cluster and access HDFS in the other.

  • either you use the local core-site.xml that says that hadoop.security.authentication is simple, and you can connect to local YARN/HDFS
  • or you point to a copy of the remote core-site.xml that says that hadoop.security.authentication is kerberos, and you can connect to remote YARN/HDFS
  • but you cannot use the local, unsecure YARN and access the remote, secure HDFS

Note that with unsecure-unsecure or secure-secure combinations, you could access HDFS in another cluster, by hacking your own custom hdfs-site.xml to define multiple namespaces. But you are stuck to a single authentication model.
[edit] see the comment by Mighty Steve Loughran about an extra Spark property to access remote, secure HDFS from a local, secure cluster.

Note also that with DistCp you are stuck the same way -- except that there's a "cheat" property that allows you to go from secure to unsecure.

Samson Scharfrichter
  • 8,884
  • 1
  • 17
  • 36
  • Thanks for you answer. it sounds reasonable as it seems like it's the error we're getting. Do you happen to know if we can achieve it by maybe switching to Spark's standalone cluster? – Koby Mar 07 '17 at 15:09
  • Don't know -- but I'm afraid that, if you start a standalone cluster, then Spark will not care about Hadoop cluster configuration options. And vice-versa. – Samson Scharfrichter Mar 07 '17 at 15:14
  • 1
    Basically, Spark authenticates against YARN to *(a)* get a Hadoop delegation token *(valid against YARN and HDFS for 7 hours, no need to get back to Kerberos, except for long-running jobs such as Streaming)* and *(b)* request containers for its executors, if needed. It never authenticates directly against HDFS. – Samson Scharfrichter Mar 07 '17 at 15:21
  • If you set `spark.yarn.access.namenodes` to a list of hdfs clusters, spark will ask for an HDFS delegation token for all the remote filesystems. This will only be used if kerberos is turned on and the user logged in. Otherwise spark looks at cluster config, sees "no kerberos" and skips it. – stevel Mar 09 '17 at 13:38
  • That cheat property is 'ipc.client.fallback-to-simple-auth-allowed', it should be backported to Spark as well – tribbloid Jan 12 '18 at 16:20
  • @tribbloid is this `ipc.client.fallback-to-simple-auth-allowed` backported to Spark also? – Sangram Gaikwad Oct 03 '18 at 10:55