1

I've created a model:

val model = pipeline.fit(commentLower)

and I'm attempting to write it to s3:

sc.hadoopConfiguration.set("fs.s3.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem")
sc.hadoopConfiguration.set("fs.s3.awsAccessKeyId", "MYACCESSKEY")
sc.hadoopConfiguration.set("fs.s3.awsSecretAccessKey", "MYSECRETKEY")
model.write.overwrite().save("s3n://sparkstore/model")

but I'm getting this error:

Name: java.lang.IllegalArgumentException
Message: Wrong FS: s3n://sparkstore/model, expected: file:///
StackTrace: org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:645)
org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalFileSystem.java:80)
org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:529)
org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:747)
org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:524)
org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:409)
org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1400)
org.apache.spark.ml.util.MLWriter.save(ReadWrite.scala:80)

I also tried with my access key inline:

model.write.overwrite().save("s3n://MYACCESSKEY:MYSECRETKEY@/sparkstore/model")

How do I write a model (or any file for that matter) to s3 from Spark?

Ross Lewis
  • 755
  • 2
  • 7
  • 17

2 Answers2

2

I don't have S3 connection to test. But Here is what i think, you should use:-

val hconf=sc.hadoopConfiguration
hconf.set("fs.s3.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem")
hconf.set("fs.s3.awsAccessKeyId", "MYACCESSKEY")
hconf.set("fs.s3.awsSecretAccessKey", "MYSECRETKEY")

When i do df.write.save("s3://sparkstore/model") I get Name: org.apache.hadoop.fs.s3.S3Exception Message: org.jets3t.service.S3ServiceException: S3 HEAD request failed for '/model' - ResponseCode=403, ResponseMessage=Forbidden StackTrace: org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.handleServiceException(Jets3tNativeFileSystemStore.java:229) org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.retrieveMetadata(Jets3tNativeFileSystemStore.java:111) s

which makes me believe that it did recongnize s3 protocal for s3 fs. But it failed authentication which is obvious.

Hopefully it fixes your issue.

Thanks, Charles.

charles gomes
  • 2,145
  • 10
  • 15
  • Thanks for the input! I don't have a problem saving a dataframe, though. The issue is specific to PipelineModels. I'm adding an answer I found elsewhere. – Ross Lewis Sep 19 '16 at 23:09
0

This isn't exactly what I wanted to do, but I found a similar thread with a similar problem:

How to save models from ML Pipeline to S3 or HDFS?

This is what I ended up doing:

sc.parallelize(Seq(model), 1).saveAsObjectFile("swift://RossL.keystone/model")
val modelx = sc.objectFile[PipelineModel]("swift://RossL.keystone/model").first()
Community
  • 1
  • 1
Ross Lewis
  • 755
  • 2
  • 7
  • 17