2

I try to query Hive tables via Spark 2.2.1 by creating a HiveContext. It turns out that Spark (whether I submit my jobs via spark-submit or run it in pyspark shell - same effect) works but only can see the default database in Hive and unable to see any others. It seems like this problem has been known for some time, and all advice is about adjusting such Spark parameters as --deploy-mode and --master and passing hive-site.xml file to Spark explicitly.

After reading everything I could find on this problem, I changed the spark-submit command to the following:

/bin/spark-submit --driver-class-path /opt/sqljdbc_6.0/sqljdbc_6.0/enu/jre8/sqljdbc42.jar --deploy-mode cluster --files /usr/hdp/current/spark2-client/conf/hive-site.xml --master yarn /home/konstantin/myscript.py

(the --driver-class-path argument is for querying MSSQL base within the script, this is not relevant to the problem).

Once I run this command, I get the following error:

18/02/22 19:23:44 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
18/02/22 19:23:45 WARN shortcircuit.DomainSocketFactory: The short-circuit local reads feature cannot be used because libhadoop cannot be loaded.
Exception in thread "main" java.lang.NoClassDefFoundError: com/sun/jersey/api/client/config/ClientConfig
    at org.apache.hadoop.yarn.client.api.TimelineClient.createTimelineClient(TimelineClient.java:55)
    at org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.createTimelineClient(YarnClientImpl.java:181)
    at org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.serviceInit(YarnClientImpl.java:168)
    at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
    at org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:152)
    at org.apache.spark.deploy.yarn.Client.run(Client.scala:1109)
    at org.apache.spark.deploy.yarn.Client$.main(Client.scala:1168)
    at org.apache.spark.deploy.yarn.Client.main(Client.scala)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:775)
    at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
    at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
    at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:119)
    at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.ClassNotFoundException: com.sun.jersey.api.client.config.ClientConfig
    at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
    at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:335)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
    ... 17 more

Process finished with exit code 0

According to the advice I found here, I downloaded the jersey-bundle-1.17.1.jar, put it on the local system, and passed it to spark-submit with the --jars key:

/bin/spark-submit --driver-class-path /opt/sqljdbc_6.0/sqljdbc_6.0/enu/jre8/sqljdbc42.jar --jars /home/konstantin/jersey-bundle-1.17.1.jar --deploy-mode cluster --files /usr/hdp/current/spark2-client/conf/hive-site.xml --master yarn /home/konstantin/myscript.py

This made no effect, I am still getting the same NoClassDefFoundError as above. Therefore, I cannot evaluate older solutions to the initial problem (Spark cannot see Hive databases) since I'm stuck with the error.

Will appreciate any suggestions.

Konstantin Popov
  • 1,546
  • 4
  • 15
  • 19
  • Your question should me much more targeted: is the problem specific to `yarn-cluster` mode? If yes, check https://stackoverflow.com/questions/45477155/hive-site-missing-when-using-spark-submit-yarn-cluster-mode – Samson Scharfrichter Feb 22 '18 at 20:26
  • What do you nean exactly by "cannot see any DB other than `default`"? Do you see the expected tables in `default`, or is it empty - implying that actually you are not connected to the Metastore hence Spark uses an embedded Derby DB to emulate the Hive metastore? – Samson Scharfrichter Feb 22 '18 at 20:29
  • Your error has nothing to do with Hive, but the driver or executor classpath. For example, HDP Spark2 is not Spark 2.2.1 and you forgot ` --jars` to pass JAR file to the executors – OneCricketeer Feb 23 '18 at 03:16
  • Experiencing similar issues with pyspark on hortonworks. `spark.sql("show databases").show()` only returns **default** – zar3bski Sep 02 '19 at 18:53

1 Answers1

1

Please do check the yarn logs what the property of spark.hive.warehouse is set to. If it is nil then your hive-site.xml is not getting distributed properly.

The problem occurs mostly due to hive-site.xml. Please do check in the spark ui environment tab whether the file is getting distributed properly

loneStar
  • 3,780
  • 23
  • 40