2

I am building a lambda architecture and need Spark as the batch part of it to restart itself either at regular intervals or right after finishing, or have the restart be called by a Spark Streaming job. I've looked at things and I probably don't understand the Spark context, but am not sure whether I can just put the Spark context in a loop. Could anyone provide any quick guidance? Another quick question is that, considering there will be data being continually added into HBase, where Spark will be reading the data from, is there any use for caching? Thanks in advance for the help.

Edit: would all computations be redone if I implement a SparkListener and upon job ending call collect?

SpooXter
  • 119
  • 1
  • 10

2 Answers2

2

When you call awaitTermination() StreamingContext will not exit and continue running. You need to call stop() from another thread to stop streaming context.

 JavaDStream<T> jsonStrem = streamingContext.receiverStream(receiver);              
 streamingContext.start();
 streamingContext.awaitTermination();

receiver will be called as per batch interval

anupsth
  • 657
  • 1
  • 6
  • 18
  • Thanks for the input. The interesting thing is that I'm actually trying to have Batch do this, as in a SparkContext, not a streaming one. My Spark Streaming job takes care of the speed layer of the lambda architecture and I have a Spark job for doing the batch layer. I hope that makes it clearer. Maybe I'm misunderstanding the lambda architecture in terms of its use. – SpooXter Mar 18 '16 at 15:45
0

Seems it was easier than I thought. I suspected while loops wouldn't work outside RDD functions, because of that lazy execution going on with Spark. I was wrong. This example here hinted it was possible: https://github.com/apache/spark/blob/master/examples/src/main/java/org/apache/spark/examples/JavaPageRank.java

SpooXter
  • 119
  • 1
  • 10