1

I use Spark v1.6.1 and Hive v1.2.x with Python v2.7

For Hive, I have some tables (ORC files) stored in HDFS and some stored in S3. If we are trying to join 2 tables, where one is in HDFS and the other is in S3, a java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.S3AFileSystem not found is thrown.

For example this works when querying against a HIVE table in HDFS.

df1 = sqlContext.sql('select * from hdfs_db.tbl1')

This works when querying against a HIVE table in S3.

df2 = sqlContext.sql('select * from s3_db.tbl2')

This code below throws the above RuntimeException.

sql = """
select *
from hdfs_db.tbl1 a
 join s3_db.tbl2 b on a.id = b.id
"""
df3 = sqlContext.sql(sql)

We are migrating from HDFS to S3, and so that's why there is a difference of storage backing HIVE tables (basically, ORC files in HDFS and S3). One interesting thing is that if we use DBeaver or the beeline clients to connect to Hive and issue the joined query, it works. I can also use sqlalchemy to issue the joined query and get result. This problem only shows on Spark's sqlContext.

More information on execution and environment: This code is executed in Jupyter notebook on an edge node (that already has spark, hadoop, hive, tez, etc... setup/configured). The Python environment is managed by conda for Python v2.7. Jupyter is started with pyspark as follows.

IPYTHON_OPTS="notebook --port 7005 --notebook-dir='~/' --ip='*' --no-browser" \
 pyspark \
 --queue default \
 --master yarn-client

When I go to the Spark Application UI under Environment, the following Classpath Entries has the following.

  • /usr/hdp/2.4.2.0-258/spark/lib/datanucleus-api-jdo-3.2.6.jar
  • /usr/hdp/2.4.2.0-258/spark/lib/datanucleus-core-3.2.10.jar
  • /usr/hdp/2.4.2.0-258/spark/lib/datanucleus-rdbms-3.2.9.jar
  • /usr/hdp/2.4.2.0-258/spark/lib/spark-assembly-1.6.1.2.4.2.0-258-hadoop2.7.1.2.4.2.0-258.jar
  • /usr/hdp/current/hadoop-client/conf/
  • /usr/hdp/current/spark-historyserver/conf/

The sun.boot.class.path has the following value: /usr/jdk64/jdk1.8.0_60/jre/lib/resources.jar:/usr/jdk64/jdk1.8.0_60/jre/lib/rt.jar:/usr/jdk64/jdk1.8.0_60/jre/lib/sunrsasign.jar:/usr/jdk64/jdk1.8.0_60/jre/lib/jsse.jar:/usr/jdk64/jdk1.8.0_60/jre/lib/jce.jar:/usr/jdk64/jdk1.8.0_60/jre/lib/charsets.jar:/usr/jdk64/jdk1.8.0_60/jre/lib/jfr.jar:/usr/jdk64/jdk1.8.0_60/jre/classes.

The spark.executorEnv.PYTHONPATH has the following value: /usr/hdp/2.4.2.0-258/spark/python/lib/py4j-0.9-src.zip:/usr/hdp/2.4.2.0-258/spark/python/:<CPS>{{PWD}}/pyspark.zip<CPS>{{PWD}}/py4j-0.9-src.zip.

The Hadoop distribution is via CDH: Hadoop 2.7.1.2.4.2.0-258

Jane Wayne
  • 8,205
  • 17
  • 75
  • 120

1 Answers1

1

Quoting the Steve Loughran (who given his track record in Spark development seems to be the source of truth about the topic of accessing S3 file systems) from SPARK-15965 No FileSystem for scheme: s3n or s3a spark-2.0.0 and spark-1.6.1:

This is being fixed with tests in my work in SPARK-7481; the manual workaround is

Spark 1.6+ This needs my patch a rebuild of spark assembly. However, once that patch is in, trying to use the assembly without the AWS JARs will stop spark from starting —unless you move up to Hadoop 2.7.3

There are also some other sources where you can find workarounds:

I've got no environment (and experience) to give the above a shot so after you give the above a try please report back to have a better understanding of the current situation regarding S3 support in Spark. Thanks.

Community
  • 1
  • 1
Jacek Laskowski
  • 72,696
  • 27
  • 242
  • 420
  • 1
    I like to consider myself "source of consistent claims" rather than truth: the latter is held only by the tests. The spark hadoop-cloud module is in Spark 2.3, so all builds with -Phadoop-cloud are not set up. For Spark 1.6 you will need a matching version of hadoop-aws and amazon-s3-sdk on the CP. I don't know what CDH does there, sorry. – stevel May 19 '17 at 10:48