I use Spark v1.6.1 and Hive v1.2.x with Python v2.7
For Hive, I have some tables (ORC files) stored in HDFS and some stored in S3. If we are trying to join 2 tables, where one is in HDFS and the other is in S3, a java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.S3AFileSystem not found
is thrown.
For example this works when querying against a HIVE table in HDFS.
df1 = sqlContext.sql('select * from hdfs_db.tbl1')
This works when querying against a HIVE table in S3.
df2 = sqlContext.sql('select * from s3_db.tbl2')
This code below throws the above RuntimeException
.
sql = """
select *
from hdfs_db.tbl1 a
join s3_db.tbl2 b on a.id = b.id
"""
df3 = sqlContext.sql(sql)
We are migrating from HDFS to S3, and so that's why there is a difference of storage backing HIVE tables (basically, ORC files in HDFS and S3). One interesting thing is that if we use DBeaver
or the beeline
clients to connect to Hive and issue the joined query, it works. I can also use sqlalchemy
to issue the joined query and get result. This problem only shows on Spark's sqlContext.
More information on execution and environment: This code is executed in Jupyter
notebook on an edge node (that already has spark, hadoop, hive, tez, etc... setup/configured). The Python environment is managed by conda
for Python v2.7. Jupyter is started with pyspark
as follows.
IPYTHON_OPTS="notebook --port 7005 --notebook-dir='~/' --ip='*' --no-browser" \
pyspark \
--queue default \
--master yarn-client
When I go to the Spark Application UI under Environment
, the following Classpath Entries
has the following.
- /usr/hdp/2.4.2.0-258/spark/lib/datanucleus-api-jdo-3.2.6.jar
- /usr/hdp/2.4.2.0-258/spark/lib/datanucleus-core-3.2.10.jar
- /usr/hdp/2.4.2.0-258/spark/lib/datanucleus-rdbms-3.2.9.jar
- /usr/hdp/2.4.2.0-258/spark/lib/spark-assembly-1.6.1.2.4.2.0-258-hadoop2.7.1.2.4.2.0-258.jar
- /usr/hdp/current/hadoop-client/conf/
- /usr/hdp/current/spark-historyserver/conf/
The sun.boot.class.path
has the following value: /usr/jdk64/jdk1.8.0_60/jre/lib/resources.jar:/usr/jdk64/jdk1.8.0_60/jre/lib/rt.jar:/usr/jdk64/jdk1.8.0_60/jre/lib/sunrsasign.jar:/usr/jdk64/jdk1.8.0_60/jre/lib/jsse.jar:/usr/jdk64/jdk1.8.0_60/jre/lib/jce.jar:/usr/jdk64/jdk1.8.0_60/jre/lib/charsets.jar:/usr/jdk64/jdk1.8.0_60/jre/lib/jfr.jar:/usr/jdk64/jdk1.8.0_60/jre/classes
.
The spark.executorEnv.PYTHONPATH
has the following value: /usr/hdp/2.4.2.0-258/spark/python/lib/py4j-0.9-src.zip:/usr/hdp/2.4.2.0-258/spark/python/:<CPS>{{PWD}}/pyspark.zip<CPS>{{PWD}}/py4j-0.9-src.zip
.
The Hadoop distribution is via CDH: Hadoop 2.7.1.2.4.2.0-258