1

I have a Spark Streaming job, which when it starts, queries Hive and creates a Map[Int, String] object, which is then used for parts of the calculations the job performs.

The problem I have is that the data in Hive has the potential changes every 2 hours. I would like to have the ability to refresh the static data on a schedule, without having to restart the Spark Job every time.

The initial load of the Map object takes around a 1minute.

Any help is very welcome.

CatchingMonkey
  • 1,391
  • 2
  • 14
  • 36
  • I think you can find your answer here. https://stackoverflow.com/questions/33372264/how-can-i-update-a-broadcast-variable-in-spark-streaming – Jiayi Liao Feb 21 '19 at 04:32

1 Answers1

1

You can use a listener. Which will be triggered every time when a job is started for any stream within the spark context. Since your db is updated every two hours there is no harm updating it every-time AFAIK.

sc.addSparkListener(new SparkListener() {
override def onSparkListenerJobStart(jobStart: SparkListenerJobStart) {
//load data that to the map that will be sent to executor
}


});
Asiri Liyana Arachchi
  • 2,663
  • 5
  • 24
  • 43