0

My saga continues -

In short I'm trying to create a teststack for spark - aim being to read a file from an s3 bucket and then write it to another. Windows env.

I was repeatedly encountering errors when trying to access S3 or S3n as a ClassNotFoundException was being thrown. These classes were added to the core-site.xml as the s3 and s3n.impl

I added the hadoop/share/tools/lib to the classpath to no avail, I then added the aws-java-jdk and hadoop-aws jars to the share/hadoop/common folder and I am now able to list the contents of a bucket using haddop on the command line.

hadoop fs -ls "s3n://bucket" shows me the contents, this is great news :)

In my mind the hadoop configuration should be picked up by spark so solving one should solve the other however when I run spark-shell and try to save a file to s3 I get the usual ClassNotFoundException as shown below.

I'm still quite new to this and unsure if I've missed something obvious, hopefully someone can help me solve the riddle? Any help is greatly appreciated, thanks.

The exception:

java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3native.NativeS3FileSystem not found
        at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2074)
        at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2578)
        at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2591)

my core-site.xml(which I believe to be correct now as hadoop can access s3):

    <property>
  <name>fs.s3.impl</name>
  <value>org.apache.hadoop.fs.s3.S3FileSystem</value>
</property>

<property>
    <name>fs.s3n.impl</name>
    <value>org.apache.hadoop.fs.s3native.NativeS3FileSystem</value>
    <description>The FileSystem for s3n: (Native S3) uris.</description>
</property>

and finally the hadoop-env.cmd showing the classpath(which is seemingly ignored):

    set HADOOP_CONF_DIR=C:\Spark\hadoop\etc\hadoop

@rem ##added as s3 filesystem not found.http://stackoverflow.com/questions/28029134/how-can-i-access-s3-s3n-from-a-local-hadoop-2-6-installation
set HADOOP_USER_CLASSPATH_FIRST=true
set HADOOP_CLASSPATH=%HADOOP_CLASSPATH%:%HADOOP_HOME%\share\hadoop\tools\lib\*

@rem Extra Java CLASSPATH elements.  Automatically insert capacity-scheduler.
if exist %HADOOP_HOME%\contrib\capacity-scheduler (
  if not defined HADOOP_CLASSPATH (
    set HADOOP_CLASSPATH=%HADOOP_HOME%\contrib\capacity-scheduler\*.jar
  ) else (
    set HADOOP_CLASSPATH=%HADOOP_CLASSPATH%;%HADOOP_HOME%\contrib\capacity-scheduler\*.jar
  )
)

EDIT: spark-defaults.conf

spark.driver.extraClassPath=C:\Spark\hadoop\share\hadoop\common\lib\hadoop-aws-2.7.1.jar:C:\Spark\hadoop\share\hadoop\common\lib\aws-java-sdk-1.7.4.jar
spark.executor.extraClassPath=C:\Spark\hadoop\share\hadoop\common\lib\hadoop-aws-2.7.1.jar:C:\Spark\hadoop\share\hadoop\common\lib\aws-java-sdk-1.7.4.jar
null
  • 3,469
  • 7
  • 41
  • 90
  • 1
    Have you tried explicitly putting them on the classpath when you submit a spark job? – Yuval Itzchakov Mar 23 '16 at 12:21
  • @YuvalItzchakov I literally, and I mean LITERALLY was here : http://stackoverflow.com/questions/29099115/spark-submit-add-multiple-jars-in-classpath :D I'm going to try it now – null Mar 23 '16 at 12:22
  • I pass them with the `--jars` flag AND add them to the classpath via: `spark.driver.extraClassPath=hadoop-aws-2.7.1.jar:aws-java-sdk-1.10.50.jar`. Same goes for `spark.executor.extraClassPath`. – Yuval Itzchakov Mar 23 '16 at 12:24
  • @YuvalItzchakov at the moment I'm not submitting a spark job per se, I was just trying to save a file from the shell - the information on the spark.driver classpath could be invaluable though - did you also have the same problem? – null Mar 23 '16 at 12:26
  • I had a similar problem, not identical to yours. I'm using Spark-Streaming for stateful computation and I'm checkpointing to S3 on AWS. – Yuval Itzchakov Mar 23 '16 at 12:27
  • @YuvalItzchakov I see, sounds fun. :) I'm now editing the spark-defaults.conf and adding those jars, hopefully it'll work :) – null Mar 23 '16 at 12:30
  • @YuvalItzchakov, would you mind taking a look at the edit for the config file and checking it please? Not sure if accurate or not. – null Mar 23 '16 at 12:34
  • Are you running Spark on Windows? – Yuval Itzchakov Mar 23 '16 at 12:41
  • @YuvalItzchakov, I'm afraid so, yes. – null Mar 23 '16 at 12:46
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/107132/discussion-between-yuval-itzchakov-and-null). – Yuval Itzchakov Mar 23 '16 at 12:47

1 Answers1

2

You need to pass some parameters to your spark-shell. Try this flag --packages org.apache.hadoop:hadoop-aws:2.7.2 .

avloss
  • 2,389
  • 2
  • 22
  • 26
  • Hi, packages des work but this will download them each time, instead I'm using --jars which adds it to the class path. – null Mar 24 '16 at 08:00
  • this won't download them each time, it'll cache them locally – avloss Mar 24 '16 at 10:39