custom spark does not find hive databases when running on yarn

Question

Starting a custom version of spark on yarn in HDP works fine following the tutorial from https://georgheiler.com/2019/05/01/headless-spark-on-yarn/ i.e. following:

# download a current headless version of spark
export SPARK_DIST_CLASSPATH=$(hadoop classpath)
export HADOOP_CONF_DIR=/usr/hdp/current/spark2-client/conf
export SPARK_HOME=<<path/to>>/spark-2.4.3-bin-without-hadoop/
<<path/to>>/spark-2.4.3-bin-without-hadoop/bin/spark-shell --master yarn --deploy-mode client --queue <<my_queue>> --conf spark.driver.extraJavaOptions='-Dhdp.version=2.6.<<version>>' --conf spark.yarn.am.extraJavaOptions='-Dhdp.version=2.6.<<version>>'

Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.4.3
      /_/

However, a:

spark.sql("show databases").show

only returns:

+------------+
|databaseName|
+------------+
|     default|
+------------+

Now trying to pass the original HDP configuration (which is apparently not read in by my custom version of spark) like:

one:

--files /usr/hdp/current/spark2-client/conf/hive-site.xml

two:

--conf spark.hive.metastore.uris='thrift://master001.my.corp.com:9083,thrift://master002.my.corp.com:9083,thrift://master003.my.corp.com:9083' --conf spark.hive.metastore.sasl.enabled='true' --conf hive.metastore.uris='thrift://master001.my.corp.com:9083,thrift://master002.my.corp.com:9083,thrift://master003.my.corp.com:9083' --conf hive.metastore.sasl.enabled='true'

three:

--conf spark.yarn.dist.files='/usr/hdp/current/spark2-client/conf/hive-site.xml'

four:

--conf spark.sql.warehouse.dir='/apps/hive/warehouse'

all does not help to solve the issue. How can I get spark to recognize the hive databases?

Probably related to [this](https://stackoverflow.com/questions/32586793/howto-add-hive-properties-at-runtime-in-spark-shell/53581393#53581393) one — abiratsis, May 26 '19 at 14:20
Finally answered in https://stackoverflow.com/questions/63668341/spark-3-x-on-hdp-3-1-in-headless-mode-with-hive-hive-tables-not-found with the pre-built jars including Hive — Georg Heiler, Aug 31 '20 at 10:29

score 0 · Answer 1 · answered May 21 '19 at 08:40

0

You can copy the hive-site.xml located in the /usr/hdp/hdp.version/hive/conf or /opt/hdp/hdp.version/hive/conf, depending upon where the HDP is installed, in to the conf directory of the headless spark installation. Now when you restart the Spark-Shell it should pick this hive configuration and load all the schemas present in Apache Hive

answered May 21 '19 at 08:40

Yayati Sule

1,601
13
25

`/usr/hdp/current/spark2-client/conf/hive-site.xml`was already copied and did also not work. Now following your suggestion and using: `/usr/hdp/current/hive-client/conf/hive-site.xml`, it fails as mentioned above. – Georg Heiler May 21 '19 at 08:49
You need to copy the `hive-site.xml` file in side the **hive/conf** folder which is located under the folder named **2.6.5.xxx** inside `/usr/hdp/`. For my installation, the path is as follows: `/usr/hdp/2.6.5.0-292/hive/conf/` – Yayati Sule May 21 '19 at 08:52
`/usr/hdp/current/hive-client` is a softlink to the `/usr/hdp/<>/hive/conf/`, but as mentioned above somehow the wrong configuration is loaded. My destination for the copy is: `<>/spark-2.4.3-bin-without-hadoop/conf` – Georg Heiler May 21 '19 at 08:55
Do you have **spark-thrift-sparkconf.conf** in your `<>/spark-2.4.3-bin-without-hadoop/conf` directory? This file is generated by Apache Ambari for Apache Spark2 bundled with your HDP distribution. – Yayati Sule May 21 '19 at 09:23
No, but this is also not in the original: `ls /usr/hdp/current/spark2-client/conf` path – Georg Heiler May 21 '19 at 09:29
The File I mentioned in my last comment resides in `/usr/hdp/2.6.5.0-xxx/spark2/conf` directory. The File is generated by Ambari when we are installing the HDP distribution. You should be looking at `/usr/hdp/2.6.5.0-xxx/spark2/conf` instead of `/usr/hdp/current/spark2-client/conf`. – Yayati Sule May 21 '19 at 09:31
Even copying `spark-defaults.conf spark-env.sh` did not fix the issue, and even in `/usr/hdp/2.6.5.0-xxx/spark2/conf` there is no such file for me. However, I can find this file on a different cluster node. **But even with this file** it still fails to find the databases. – Georg Heiler May 21 '19 at 09:35
Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/193684/discussion-between-yayati-sule-and-georg-heiler). – Yayati Sule May 21 '19 at 10:07

score 0 · Answer 2 · answered May 26 '19 at 11:24

0

Hive jars need to be in the classpath of Spark for hive support to be enabled. if the hive jars are not present in classpath, the catalog implementation used is in-memory
In spark-shell we can confirm this by executing

sc.getConf.get("spark.sql.catalogImplementation")

which will give in-memory

Why are hive classes required

    def enableHiveSupport(): Builder = synchronized {
      if (hiveClassesArePresent) {
        config(CATALOG_IMPLEMENTATION.key, "hive")
      } else {
        throw new IllegalArgumentException(
          "Unable to instantiate SparkSession with Hive support because " +
            "Hive classes are not found.")
      }
    }

SparkSession.scala

  private[spark] def hiveClassesArePresent: Boolean = {
    try {
      Utils.classForName(HIVE_SESSION_STATE_BUILDER_CLASS_NAME)
      Utils.classForName("org.apache.hadoop.hive.conf.HiveConf")
      true
    } catch {
      case _: ClassNotFoundException | _: NoClassDefFoundError => false
    }
  }

If the classes are not present, Hive support is not enabled. Link to the code where the above checks happen as part of Spark shell initialization.

In the above code pasted as part of question, SPARK_DIST_CLASSPATH is populated only with the Hadoop classpath and the paths to Hive jars missing.

answered May 26 '19 at 11:24

DaRkMaN

1,014
6
9

But isn't spark by default compiled with hive support? – Georg Heiler May 26 '19 at 11:59
Spark is compiled against Hive but Hive jars are not a path of the classpath at runtime. User can either make it part of classpath or set `spark.sql.hive.metastore.jars` or use the spark code compiled with profile `-Phive`(https://spark.apache.org/docs/latest/building-spark.html#building-with-hive-and-jdbc-support), so that hive jars are available in the spark assembly directory. – DaRkMaN May 26 '19 at 12:20
If we are using the distribution as mentioned in the blog(https://spark.apache.org/docs/latest/hadoop-provided.html), it does not have Hive or Hadoop jars and has to be supplied at runtime. – DaRkMaN May 26 '19 at 12:29
That sounds really promising. I will test it in a minute. Meanwhile: do you add the full of `/usr/hdp/current/hive-client/lib`? Or only parts of it? – Georg Heiler May 26 '19 at 15:59
We need to add the full path of the lib folder, I am not sure which directory has the Hive jars in HDP distribution though – DaRkMaN May 26 '19 at 16:02
But I am still having the same problem: `export SPARK_DIST_CLASSPATH="/usr/hdp/current/hive-client/lib*:${SPARK_DIST_CLASSPATH}"` – Georg Heiler May 26 '19 at 16:06
Doesn't solve it and`echo $SPARK_DIST_CLASSPATH /usr/hdp/current/hive-client/lib*:/usr/hdp/<>/hadoop/conf:/usr/hdp/<>/hadoop/lib/*:/usr/hdp/<>/hadoop/.//*:/usr/hdp/<>/hadoop-hdfs/./:/usr/hdp/<>/hadoop-hdfs/lib/*:/usr/hdp/<>/hadoop-hdfs/.//*:/usr/hdp/<>/hadoop-yarn/lib/*:/usr/hdp/<>/hadoop-yarn/.//*:/usr/hdp/<>/hadoop-mapreduce/lib/*:/usr/hdp/<>/hadoop-mapreduce/.//*:/usr/hdp/<>/tez/*:/usr/hdp/<>/tez/lib/*:/usr/hdp/<>/tez/conf` is loaded – Georg Heiler May 26 '19 at 16:06
The hive classpath seems wrong, can we add a forward slash after lib followed by * and give a try. And also can we check the value of sc.getConf.get("spark.sql.catalogImplementation") after shell starts. – DaRkMaN May 26 '19 at 16:10
Correct. Fixing the path now leads to loading the jars, but now I am in the classpath hell. `java.lang.NoSuchMethodError: jline.console.completer.CandidateListCompletionHandler.setPrintSpaceAfterFullCompletion(Z)V at scala.tools.nsc.interpreter.jline.JLineConsoleReader.initCompletion(JLineReader.scala:139) at scala.tools.nsc.interpreter.jline.InteractiveReader.postInit(JLineReader.scala:54) at org.apache.spark.repl.SparkILoop$$anonfun$process$1$$anonfun$1.apply(SparkILoop.scala:190)` – Georg Heiler May 26 '19 at 16:31
trying: `export SPARK_DIST_CLASSPATH=$(hadoop classpath) cp -R /usr/hdp/current/hive-client/lib/*hive* . export SPARK_DIST_CLASSPATH="/path/to/hivestuff/*:${SPARK_DIST_CLASSPATH}"` again does not return the databases from hive. – Georg Heiler May 26 '19 at 16:43
but maybe I am missing some jars. I will your link later – Georg Heiler May 26 '19 at 16:43
Ok, also tried to manually put the 2.14.6 jline jar into sparks lib folder and setting the hive classpath - it does not help. Only an in -memory hive metastore is loaded. – Georg Heiler May 26 '19 at 19:38
Quick way to check if our work around works is to start spark-sql instead of spark-shell(has a issue since > 2.40 version), after adding only the Hive jars to classpath and listing tables. – DaRkMaN May 26 '19 at 23:16
intersting: `/path/to/spark-2.4.3-bin-without-hadoop/bin/spark-sql --conf spark.drver.extraJavaOptions='-Dhdp.version=2.6.<>' --conf spark.yarn.am.extraJavaOptions='-Dhdp.version=2.6.<>'` fails with `ClassNotFoundException: org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver` – Georg Heiler May 27 '19 at 07:01
I think the above is also know issue. Basically we need an environment which does not have issues relating to Hive. Only other options left are `pyspark` shell and `SparkR` shell – DaRkMaN May 27 '19 at 07:20

custom spark does not find hive databases when running on yarn

2 Answers2

Why are hive classes required

Linked