UnsupportedFileSystemException when combine hadoop, spark and ceph

Question

Update2
Great thanks to this thread, it saved my day. For exception that's thrown in Update1, add below configuration:

<property>
  <name>fs.AbstractFileSystem.s3a.impl</name>
  <value>org.apache.hadoop.fs.s3a.S3A</value>
  <description>The implementation class of the S3A AbstractFileSystem.</description>
</property>

As per the explanation in the answer, please also upgrade hadoop to at least hadoop 2.8.0.

Leave this thread here for others who run into the same situation.

Update1
According to similar issue listed at Hortonworks, I realized I also need to specify the fs.s3a.impl explicitly, so I added below settings to $HADOOP_HOME/etc/hadoop/core-site.xml,

<property>
  <name>fs.s3a.impl</name>
  <value>org.apache.hadoop.fs.s3a.S3AFileSystem</value>
  <description>The implementation class of the S3A Filesystem</description>
</property>

After that, the previous error disappears and new exception thrown,

java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.S3A not found
        at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2195)
        at org.apache.hadoop.fs.AbstractFileSystem.createFileSystem(AbstractFileSystem.java:158)

Working it out.

According to my current understanding, the reason why UnsupportedFileSystemException is thrown is because hadoop doesn't "know" the existence of that file system. By specifying the implementation of specific filesystem through config file or runtime, the issue can be solved.

Original
I'm doing something about using Spark-Sql to query data under Ceph.

Currently, my software stack is hadoop 2.7.3, spark 2.3.0 and Ceph luminous.

According to existed thread like hadoop-2-9-2-spark-2-4-0-access-aws-s3a-bucket, the version of hadoop, aws related library and spark are critical. I'm using hadoop-aws-2.7.3.jar and aws-java.sdk-1.7.4.jar according to the solution mentioned. By putting those jars into $SPARK_HOME/jars, I solved HIVE_STATS_JDBC_TIMEOUT issue, when I run spark-sql in shell, every thing works so far so good.

However, I'm stuck again when I tried to invoke spark-sql on top of yarn. The command I used is spark-sql --master=yarn, and the exception thrown is

org.apache.hadoop.fs.UnsupportedFileSystemException: fs.AbstractFileSystem.s3a.impl=null: No AbstractFileSystem configured for scheme: s3a
        at org.apache.hadoop.fs.AbstractFileSystem.createFileSystem(AbstractFileSystem.java:160)
        at org.apache.hadoop.fs.AbstractFileSystem.get(AbstractFileSystem.java:249)

To solve that, I tried to put hadoop-aws-2.7.3.jar and aws-java.sdk-1.7.4.jar into hadoop classpath like $HADOOP_HOME/share/hadoop/common/lib.

However, things are still as old days.

Maybe I need to change the version of Spark? I had racked my brain. Any comment is welcomed, thanks for your help in advance.

If you solved your problem, why not write it up as an answer to this question? — , Apr 14 '20 at 02:08
@thsutton Good suggestion, but I think the update in the question is enough to help someone who might also come to this thread. — Eugene, Apr 14 '20 at 03:10

UnsupportedFileSystemException when combine hadoop, spark and ceph

0 Answers0