1

I'm attempting to run a pyspark script on BigInsights on Cloud 4.2 Enterprise that accesses a Hive table.

First I create the hive table:

[biadmin@bi4c-xxxxx-mastermanager ~]$ hive
hive> CREATE TABLE pokes (foo INT, bar STRING);
OK
Time taken: 2.147 seconds
hive> LOAD DATA LOCAL INPATH '/usr/iop/4.2.0.0/hive/doc/examples/files/kv1.txt' OVERWRITE INTO TABLE pokes;
Loading data to table default.pokes
Table default.pokes stats: [numFiles=1, numRows=0, totalSize=5812, rawDataSize=0]
OK
Time taken: 0.49 seconds
hive> 

Then I create a simple pyspark script:

[biadmin@bi4c-xxxxxx-mastermanager ~]$ cat test_pokes.py
from pyspark import SparkContext

sc = SparkContext()

from pyspark.sql import HiveContext
hc = HiveContext(sc)

pokesRdd = hc.sql('select * from pokes')
print( pokesRdd.collect() )

I attempt to execute with:

[biadmin@bi4c-xxxxxx-mastermanager ~]$ spark-submit \
    --master yarn-cluster \
    --deploy-mode cluster \
    --jars /usr/iop/4.2.0.0/hive/lib/datanucleus-api-jdo-3.2.6.jar, \
           /usr/iop/4.2.0.0/hive/lib/datanucleus-core-3.2.10.jar, \
           /usr/iop/4.2.0.0/hive/lib/datanucleus-rdbms-3.2.9.jar \
    test_pokes.py

However, I encounter the error:

Traceback (most recent call last):
  File "test_pokes.py", line 8, in <module>
    pokesRdd = hc.sql('select * from pokes')
  File "/disk6/local/usercache/biadmin/appcache/application_1477084339086_0481/container_e09_1477084339086_0481_01_000001/pyspark.zip/pyspark/sql/context.py", line 580, in sql
  File "/disk6/local/usercache/biadmin/appcache/application_1477084339086_0481/container_e09_1477084339086_0481_01_000001/py4j-0.9-src.zip/py4j/java_gateway.py", line 813, in __call__
  File "/disk6/local/usercache/biadmin/appcache/application_1477084339086_0481/container_e09_1477084339086_0481_01_000001/pyspark.zip/pyspark/sql/utils.py", line 51, in deco
pyspark.sql.utils.AnalysisException: u'Table not found: pokes; line 1 pos 14'
End of LogType:stdout

If I run spark-submit standalone, I can see the table exists ok:

[biadmin@bi4c-xxxxxx-mastermanager ~]$ spark-submit test_pokes.py
…
…
16/12/21 13:09:13 INFO Executor: Finished task 0.0 in stage 0.0 (TID 0). 18962 bytes result sent to driver
16/12/21 13:09:13 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 168 ms on localhost (1/1)
16/12/21 13:09:13 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
16/12/21 13:09:13 INFO DAGScheduler: ResultStage 0 (collect at /home/biadmin/test_pokes.py:9) finished in 0.179 s
16/12/21 13:09:13 INFO DAGScheduler: Job 0 finished: collect at /home/biadmin/test_pokes.py:9, took 0.236558 s
[Row(foo=238, bar=u'val_238'), Row(foo=86, bar=u'val_86'), Row(foo=311, bar=u'val_311')
…
…

See my previous question related to this issue: hive spark yarn-cluster job fails with: "ClassNotFoundException: org.datanucleus.api.jdo.JDOPersistenceManagerFactory"

This question is similar to this other question: Spark can access Hive table from pyspark but not from spark-submit. However, unlike that question I am using HiveContext.


Update: see here for the final solution https://stackoverflow.com/a/41272260/1033422

Community
  • 1
  • 1
Chris Snow
  • 23,813
  • 35
  • 144
  • 309

3 Answers3

6

This is because the spark-submit job is unable to find the hive-site.xml, so it cannot connect to the Hive metastore. Please add --files /usr/iop/4.2.0.0/hive/conf/hive-site.xml to your spark-submit command.

Fokko Driesprong
  • 2,075
  • 19
  • 31
  • That doesn't really explain why it works in standalone mode – OneCricketeer Dec 21 '16 at 13:28
  • This gets me one step further. I now receive the error: `MetaException(message:Failed to instantiate listener named: com.ibm.biginsights.bigsql.sync.BIEventListener, reason: java.lang.ClassNotFoundException: com.ibm.biginsights.bigsql.sync.BIEventListener)` – Chris Snow Dec 21 '16 at 13:32
  • 1
    Sorry, I should explain. If you run standalone, the driver runs on the machine itself. Therefore it will pick up the `hive-site.xml` from your local classpath. If you run on the `cluster-mode`, this xml file is not transferred to the container on the cluster, so you will have to specify it by hand and Spark will put it in the classpath for you. – Fokko Driesprong Dec 21 '16 at 13:33
  • 1
    I don't recognise the error, but that is most likely because I have very little experience with BigInsights. I would say that you are missing a jar file in the classpath of the application, but I don't know what jar is required. – Fokko Driesprong Dec 21 '16 at 13:34
  • Thanks - I'll accept this answer and raise that as another question. – Chris Snow Dec 21 '16 at 13:34
  • The other question with the new error is here: http://stackoverflow.com/questions/41264229/spark-hive-reporting-classnotfoundexception-com-ibm-biginsights-bigsql-sync-bie?noredirect=1&lq=1 – Chris Snow Dec 21 '16 at 15:16
  • You are awesome. Thank you. – kfkhalili Jun 23 '17 at 12:43
2

It looks like you are affected by this bug: https://issues.apache.org/jira/browse/SPARK-15345.



I had a similar issue with Spark 1.6.2 and 2.0.0 on HDP-2.5.0.0:
My goal was to create a dataframe from a Hive SQL query, under these conditions:

  • python API,
  • cluster deploy-mode (driver program running on one of the executor nodes)
  • use YARN to manage the executor JVMs (instead of a standalone Spark master instance).

The initial tests gave these results:

  1. spark-submit --deploy-mode client --master local ... => WORKING
  2. spark-submit --deploy-mode client --master yarn ... => WORKING
  3. spark-submit --deploy-mode cluster --master yarn .... => NOT WORKING

In case #3, the driver running on one of the executor nodes could find the database. The error was:

pyspark.sql.utils.AnalysisException: 'Table or view not found: `database_name`.`table_name`; line 1 pos 14'



Fokko Driesprong's answer listed above worked for me.
With, the command listed below, the driver running on the executor node was able to access a Hive table in a database which is not default:

$ /usr/hdp/current/spark2-client/bin/spark-submit \
--deploy-mode cluster --master yarn \
--files /usr/hdp/current/spark2-client/conf/hive-site.xml \
/path/to/python/code.py



The python code I have used to test with Spark 1.6.2 and Spark 2.0.0 is: (Change SPARK_VERSION to 1 to test with Spark 1.6.2. Make sure to update the paths in the spark-submit command accordingly)

SPARK_VERSION=2      
APP_NAME = 'spark-sql-python-test_SV,' + str(SPARK_VERSION)



def spark1():
    from pyspark.sql import HiveContext
    from pyspark import SparkContext, SparkConf

    conf = SparkConf().setAppName(APP_NAME)
    sc = SparkContext(conf=conf)
    hc = HiveContext(sc)

    query = 'select * from database_name.table_name limit 5'
    df = hc.sql(query)
    printout(df)




def spark2():
    from pyspark.sql import SparkSession
    spark = SparkSession.builder.appName(APP_NAME).enableHiveSupport().getOrCreate()
    query = 'select * from database_name.table_name limit 5'
    df = spark.sql(query)
    printout(df)




def printout(df):
    print('\n########################################################################')
    df.show()
    print(df.count())

    df_list = df.collect()
    print(df_list)
    print(df_list[0])
    print(df_list[1])
    print('########################################################################\n')




def main():
    if SPARK_VERSION == 1:
        spark1()
    elif SPARK_VERSION == 2:
        spark2()




if __name__ == '__main__':
    main()
Raphvanns
  • 1,766
  • 19
  • 21
0

For me the accepted answer did not work.
(--files /usr/iop/4.2.0.0/hive/conf/hive-site.xml)

Adding the below code on top of the code file solved it.

import findspark
findspark.init('/usr/share/spark-2.4')  # for 2.4
Chris
  • 951
  • 10
  • 26