Spark Streaming checkpoint to amazon s3

Question

I am trying to checkpoint the rdd to non-hdfs system. From DSE document it seems that it is not possible to use cassandra file system. So I am planning to use amazon s3 . But I am not able to find any good example to use the AWS.

Questions

How do I use Amazon S3 as checkpoint directory ?Is it just enough to call ssc.checkpoint(amazons3url) ?
Is it possible to have any other reliable data storage other than hadoop file system for checkpoint ?

score 8 · Accepted Answer · edited May 23 '17 at 12:25

From the answer in the link

Solution 1:

export AWS_ACCESS_KEY_ID=<your access>
export AWS_SECRET_ACCESS_KEY=<your secret>
ssc.checkpoint(checkpointDirectory)

Set the checkpoint directory as S3 URL - s3n://spark-streaming/checkpoint

And then launch your spark application using spark submit. This works in spark 1.4.2

solution 2:

  val hadoopConf: Configuration = new Configuration()
  hadoopConf.set("fs.s3.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem")
  hadoopConf.set("fs.s3n.awsAccessKeyId", "id-1")
  hadoopConf.set("fs.s3n.awsSecretAccessKey", "secret-key")

  StreamingContext.getOrCreate(checkPointDir, () => {
        createStreamingContext(checkPointDir, config)
      }, hadoopConf)

score 3 · Answer 2 · answered Dec 30 '15 at 23:58

3

To Checkpoint to S3, you can pass the following notation to StreamingContext def checkpoint(directory: String): Unit method

s3n://<aws-access-key>:<aws-secret-key>@<s3-bucket>/<prefix ...>

Another reliable file system not listed in the Spark Documentation for checkpointing, is Taychyon

answered Dec 30 '15 at 23:58

Jeremy Sanecki

344
2
6

Thanks for the help. The secret key usually has a slash in it. So this is not working. – Knight71 Jan 20 '16 at 16:44

Spark Streaming checkpoint to amazon s3

2 Answers2

Linked