Missing hive-site when using spark-submit YARN cluster mode

Question

Using HDP 2.5.3 and I've been trying to debug some YARN container classpath issues.

Since HDP includes both Spark 1.6 and 2.0.0, there have been some conflicting versions

Users I support are successfully able to use Spark2 with Hive queries in YARN client mode, but not from cluster mode they get errors about tables not found, or something like that because the Metastore connection isn't established.

I am guessing that setting either --driver-class-path /etc/spark2/conf:/etc/hive/conf or passing --files /etc/spark2/conf/hive-site.xml after spark-submit would work, but why isn't hive-site.xml loaded already from the conf folder?

Accoringing to Hortonworks docs, says hive-site should be placed in $SPARK_HOME/conf, and it is...

I see hdfs-site.xml and core-site.xml, and other files that are part of HADOOP_CONF_DIR, for example, and this is the from the YARN UI container info.

2232355    4 drwx------   2 yarn     hadoop       4096 Aug  2 21:59 ./__spark_conf__
2232379    4 -r-x------   1 yarn     hadoop       2358 Aug  2 21:59 ./__spark_conf__/topology_script.py
2232381    8 -r-x------   1 yarn     hadoop       4676 Aug  2 21:59 ./__spark_conf__/yarn-env.sh
2232392    4 -r-x------   1 yarn     hadoop        569 Aug  2 21:59 ./__spark_conf__/topology_mappings.data
2232398    4 -r-x------   1 yarn     hadoop        945 Aug  2 21:59 ./__spark_conf__/taskcontroller.cfg
2232356    4 -r-x------   1 yarn     hadoop        620 Aug  2 21:59 ./__spark_conf__/log4j.properties
2232382   12 -r-x------   1 yarn     hadoop       8960 Aug  2 21:59 ./__spark_conf__/hdfs-site.xml
2232371    4 -r-x------   1 yarn     hadoop       2090 Aug  2 21:59 ./__spark_conf__/hadoop-metrics2.properties
2232387    4 -r-x------   1 yarn     hadoop        662 Aug  2 21:59 ./__spark_conf__/mapred-env.sh
2232390    4 -r-x------   1 yarn     hadoop       1308 Aug  2 21:59 ./__spark_conf__/hadoop-policy.xml
2232399    4 -r-x------   1 yarn     hadoop       1480 Aug  2 21:59 ./__spark_conf__/__spark_conf__.properties
2232389    4 -r-x------   1 yarn     hadoop       1602 Aug  2 21:59 ./__spark_conf__/health_check
2232385    4 -r-x------   1 yarn     hadoop        913 Aug  2 21:59 ./__spark_conf__/rack_topology.data
2232377    4 -r-x------   1 yarn     hadoop       1484 Aug  2 21:59 ./__spark_conf__/ranger-hdfs-audit.xml
2232383    4 -r-x------   1 yarn     hadoop       1020 Aug  2 21:59 ./__spark_conf__/commons-logging.properties
2232357    8 -r-x------   1 yarn     hadoop       5721 Aug  2 21:59 ./__spark_conf__/hadoop-env.sh
2232391    4 -r-x------   1 yarn     hadoop        281 Aug  2 21:59 ./__spark_conf__/slaves
2232373    8 -r-x------   1 yarn     hadoop       6407 Aug  2 21:59 ./__spark_conf__/core-site.xml
2232393    4 -r-x------   1 yarn     hadoop        812 Aug  2 21:59 ./__spark_conf__/rack-topology.sh
2232394    4 -r-x------   1 yarn     hadoop       1044 Aug  2 21:59 ./__spark_conf__/ranger-hdfs-security.xml
2232395    8 -r-x------   1 yarn     hadoop       4956 Aug  2 21:59 ./__spark_conf__/metrics.properties
2232386    8 -r-x------   1 yarn     hadoop       4221 Aug  2 21:59 ./__spark_conf__/task-log4j.properties
2232380    4 -r-x------   1 yarn     hadoop         64 Aug  2 21:59 ./__spark_conf__/ranger-security.xml
2232372   20 -r-x------   1 yarn     hadoop      19975 Aug  2 21:59 ./__spark_conf__/yarn-site.xml
2232397    4 -r-x------   1 yarn     hadoop       1006 Aug  2 21:59 ./__spark_conf__/ranger-policymgr-ssl.xml
2232374    4 -r-x------   1 yarn     hadoop         29 Aug  2 21:59 ./__spark_conf__/yarn.exclude
2232384    4 -r-x------   1 yarn     hadoop       1606 Aug  2 21:59 ./__spark_conf__/container-executor.cfg
2232396    4 -r-x------   1 yarn     hadoop       1000 Aug  2 21:59 ./__spark_conf__/ssl-server.xml
2232375    4 -r-x------   1 yarn     hadoop          1 Aug  2 21:59 ./__spark_conf__/dfs.exclude
2232359    8 -r-x------   1 yarn     hadoop       7660 Aug  2 21:59 ./__spark_conf__/mapred-site.xml
2232378   16 -r-x------   1 yarn     hadoop      14474 Aug  2 21:59 ./__spark_conf__/capacity-scheduler.xml
2232376    4 -r-x------   1 yarn     hadoop        884 Aug  2 21:59 ./__spark_conf__/ssl-client.xml

As you might see, hive-site is not there, even though I definitely have conf/hive-site.xml for spark-submit to take

[spark@asthad006 conf]$ pwd && ls -l
/usr/hdp/2.5.3.0-37/spark2/conf
total 32
-rw-r--r-- 1 spark spark   742 Mar  6 15:20 hive-site.xml
-rw-r--r-- 1 spark spark   620 Mar  6 15:20 log4j.properties
-rw-r--r-- 1 spark spark  4956 Mar  6 15:20 metrics.properties
-rw-r--r-- 1 spark spark   824 Aug  2 22:24 spark-defaults.conf
-rw-r--r-- 1 spark spark  1820 Aug  2 22:24 spark-env.sh
-rwxr-xr-x 1 spark spark   244 Mar  6 15:20 spark-thrift-fairscheduler.xml
-rw-r--r-- 1 hive  hadoop  918 Aug  2 22:24 spark-thrift-sparkconf.conf

So, I don't think I am supposed to place hive-site in HADOOP_CONF_DIR as HIVE_CONF_DIR is separated, but my question is that how do we get Spark2 to pick up the hive-site.xml without needing to manually pass it as a parameter at runtime?

EDIT Naturally, since I'm on HDP I am using Ambari. The previous cluster admin has installed Spark2 clients on all of the machines, so all of the YARN NodeManagers that could be potential Spark drivers should have the same config files

I'm a bit slow so I did not think about that last year, but... _(a)_ `$HADOOP_CONF_DIR` may contain a **list** of entries, just like any CLASSPATH _(b)_ Spark also considers `$YARN_CONF_DIR` which may be a dirty workaround to inject Hive config _(c)_ the source code makes it clear that not everything in `$SPARK_CONF_DIR` is shipped to the YARN containers, cf. https://github.com/apache/spark/blob/branch-2.2/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala — Samson Scharfrichter, Apr 29 '18 at 21:23

score 6 · Accepted Answer · answered Sep 05 '17 at 14:30

6

You can use spark property - spark.yarn.dist.files and specify path to hive-site.xml there.

answered Sep 05 '17 at 14:30

Artur Sukhenko

602
3
12

That seems like a reasonable solution since that is just how `--files` works. Any drawbacks? – OneCricketeer Sep 05 '17 at 15:27
1

@cricket_007 That is how it is done in MapR distribution. You can check documentation here -http://maprdocs.mapr.com/home/Spark/IntegrateSparkSQL_Hive.html – Artur Sukhenko Sep 07 '17 at 14:52
Oh, nice! I never think to check MapR, only HDP and CDH – OneCricketeer Sep 07 '17 at 23:41

score 5 · Answer 2 · answered Sep 06 '17 at 16:40

The way I understand it, in local or yarn-client modes...

the Launcher checks whether it needs Kerberos tokens for HDFS, YARN, Hive, HBase
> hive-site.xml is searched in the CLASSPATH by the Hive/Hadoop client libs (including in driver.extraClassPath because the Driver runs inside the Launcher and the merged CLASSPATH is already built at this point)
the Driver checks which kind of metastore to use for internal purposes: a standalone metastore backed by a volatile Derby instance, or a regular Hive metastore
> that's $SPARK_CONF_DIR/hive-site.xml
when using the Hive interface, a Metastore connection is used to read/write Hive metadata in the Driver
> hive-site.xml is searched in the CLASSPATH by the Hive/Hadoop client libs (and the Kerberos token is used, if any)

So you can have one hive-site.xml stating that Spark should use an embedded, in-memory Derby instance to use as a sandbox (in-memory implying "stop leaving all these temp files behind you") while another hive-site.xml gives the actual Hive Metastore URI. And all is well.

Now, in yarn-cluster mode, all that mechanism pretty much explodes in a nasty, undocumented mess.

The Launcher needs its own CLASSPATH settings to create the Kerberos tokens, otherwise it fails silently. Better go to the source code to find out which undocumented Env variable you shoud use.
It may also need an override in some properties because the hard-coded defaults suddenly are not the defaults any more (silently).

The Driver cannot tap the original $SPARK_CONF_DIR, it has to rely on what the Launcher has made available for upload. Does that include a copy of $SPARK_CONF_DIR/hive-site.xml? Looks like it's not the case.
So you are probably using a Derby thing as a stub.

And the Driver has to to do with whatever YARN has forced on the container CLASSPATH, in whatever order.
Besides, the driver.extraClassPath additions do NOT take precedence by default; for that you have to force spark.yarn.user.classpath.first=true (which is translated to the standard Hadoop property whose exact name I can't remember right now, especially since there are multiple props with similar names that may be deprecated and/or not working in Hadoop 2.x)

Think that's bad? Try out connecting to a Kerberized HBase in yarn-cluster mode. The connection is done in the Executors, that's another layer of nastyness. But I disgress.

Bottom line: start your diagnostic again.

A. Are you really, really sure that the mysterious "Metastore connection errors" are caused by missing properties, and specifically the Metastore URI?

B. By the way, are your users explicitly using a HiveContext???

C. What is exactly the CLASSPATH that YARN presents to the Driver JVM, and what is exactly the CLASSPATH that the Driver presents to the Hadoop libs when opening the Metastore connection?

D. If the CLASSPATH built by YARN is messed up for some reason, what would be the minimal fix -- change in precedence rules? addition? both?

Thanks for the in-depth response. Luckily, not using Kerberos (at the moment). I'm just administering, so I can't control HiveContext vs SparkSession, but for the initial solution, I was telling people just set `hive.metastore.uris` in their code, and it worked fine, so **A**, yes. For the answer @Artur gave `spark.yarn.dist.files` does upload the file, as I expect. Just seems like a workaround IMO, but it solves **D**, although I understand what you mean by the precedence variable. — OneCricketeer, Sep 06 '17 at 17:24
On the whole, your idea of injecting `hive.metastore.uris` directly in the Hadoop properties could be a lesser evil. Especially if the value can be sourced automatically when building the `spark-submit` command line, somehow, and passed as an argument to the app. — Samson Scharfrichter, Sep 07 '17 at 22:05
I think these users are using Spring Boot to build their apps, so the configuration injection shouldn't be too difficult, it's just one thing that we'd like to avoid in case the HiveServer installation / FQDN changes. The hive-site.xml on the machine should be the source of truth — OneCricketeer, Sep 07 '17 at 23:37
Exactly, when saying _"sourced when building the command line"_ I was implying some kind of `sed` trick on `hive-site.xml`, at submit time... — Samson Scharfrichter, Sep 08 '17 at 06:33

score 1 · Answer 3 · answered Sep 03 '17 at 12:42

1

In the cluster mode configuration is read from the conf directory of the machine, which runs the driver container, not the one use for spark-submit.

answered Sep 03 '17 at 12:42

user8554772

11
1

That makes sense, so what's the solution? I use Ambari to distribute the configurations, and have Spark clients on every ResourceManager, and as you can see, the YARN container info isn't showing hive-site. – OneCricketeer Sep 03 '17 at 13:26
I meant NodeManager* – OneCricketeer Sep 04 '17 at 21:20

score 1 · Answer 4 · edited May 02 '18 at 20:52

1

Found an issue with this

You create a org.apache.spark.sql.SQLContext before creating hive context the hive-site.xml is not picked properly when you create hive context.

Solution : Create the hive context before creating another SQL context.

edited May 02 '18 at 20:52

OneCricketeer

179,855
19
132
245

answered May 02 '18 at 17:21

Naveen Krishnan

11
1

2

Actually, the bug was introduced in some 2.x version of Spark, where you should be using SparkSession instead of SQLContext anyway. I found this problem in JIRA. Just forget the ticket number. Plus, the HiveContext wraps a SQLContext, so making two of the same thing doesn't seem like a solution – OneCricketeer May 02 '18 at 20:51

Missing hive-site when using spark-submit YARN cluster mode

4 Answers4

Linked

Related