Spark Streaming - Refresh Static Data

Question

I have a Spark Streaming job, which when it starts, queries Hive and creates a Map[Int, String] object, which is then used for parts of the calculations the job performs.

The problem I have is that the data in Hive has the potential changes every 2 hours. I would like to have the ability to refresh the static data on a schedule, without having to restart the Spark Job every time.

The initial load of the Map object takes around a 1minute.

Any help is very welcome.

I think you can find your answer here. https://stackoverflow.com/questions/33372264/how-can-i-update-a-broadcast-variable-in-spark-streaming — Jiayi Liao, Feb 21 '19 at 04:32

score 1 · Accepted Answer · answered Apr 04 '19 at 17:09

You can use a listener. Which will be triggered every time when a job is started for any stream within the spark context. Since your db is updated every two hours there is no harm updating it every-time AFAIK.

sc.addSparkListener(new SparkListener() {
override def onSparkListenerJobStart(jobStart: SparkListenerJobStart) {
//load data that to the map that will be sent to executor
}


});

Spark Streaming - Refresh Static Data

1 Answers1