Update2
Great thanks to this thread, it saved my day.
For exception that's thrown in Update1
, add below configuration:
<property>
<name>fs.AbstractFileSystem.s3a.impl</name>
<value>org.apache.hadoop.fs.s3a.S3A</value>
<description>The implementation class of the S3A AbstractFileSystem.</description>
</property>
As per the explanation in the answer, please also upgrade hadoop
to at least hadoop 2.8.0
.
Leave this thread here for others who run into the same situation.
Update1
According to similar issue listed at Hortonworks, I realized I also need to specify the fs.s3a.impl
explicitly, so I added below settings to $HADOOP_HOME/etc/hadoop/core-site.xml
,
<property>
<name>fs.s3a.impl</name>
<value>org.apache.hadoop.fs.s3a.S3AFileSystem</value>
<description>The implementation class of the S3A Filesystem</description>
</property>
After that, the previous error disappears and new exception thrown,
java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.S3A not found
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2195)
at org.apache.hadoop.fs.AbstractFileSystem.createFileSystem(AbstractFileSystem.java:158)
Working it out.
According to my current understanding, the reason why UnsupportedFileSystemException
is thrown is because hadoop
doesn't "know" the existence of that file system. By specifying the implementation of specific filesystem through config file or runtime, the issue can be solved.
Original
I'm doing something about using Spark-Sql
to query data under Ceph
.
Currently, my software stack is hadoop 2.7.3
, spark 2.3.0
and Ceph luminous
.
According to existed thread like hadoop-2-9-2-spark-2-4-0-access-aws-s3a-bucket, the version of hadoop
, aws related library
and spark
are critical. I'm using hadoop-aws-2.7.3.jar
and aws-java.sdk-1.7.4.jar
according to the solution mentioned. By putting those jars into $SPARK_HOME/jars
, I solved HIVE_STATS_JDBC_TIMEOUT issue, when I run spark-sql
in shell, every thing works so far so good.
However, I'm stuck again when I tried to invoke spark-sql
on top of yarn
.
The command I used is spark-sql --master=yarn
, and the exception thrown is
org.apache.hadoop.fs.UnsupportedFileSystemException: fs.AbstractFileSystem.s3a.impl=null: No AbstractFileSystem configured for scheme: s3a
at org.apache.hadoop.fs.AbstractFileSystem.createFileSystem(AbstractFileSystem.java:160)
at org.apache.hadoop.fs.AbstractFileSystem.get(AbstractFileSystem.java:249)
To solve that, I tried to put hadoop-aws-2.7.3.jar
and aws-java.sdk-1.7.4.jar
into hadoop classpath
like $HADOOP_HOME/share/hadoop/common/lib
.
However, things are still as old days.
Maybe I need to change the version of Spark? I had racked my brain. Any comment is welcomed, thanks for your help in advance.