1

I am writing a spark streaming app with online streaming data compared to basic data which i broadcast into each computing node. However, since the basic data is updated daily, i need to update the broadcasted variable daily too. The basic data resides on hdfs.

Is there a way to do this? The update is not related to any online streaming results, just say at 12:00 am everyday. Moreover, if there is such a way, will the updating process block spark streaming computing jobs?

Soundsoul
  • 65
  • 1
  • 8
  • i have read http://stackoverflow.com/questions/33372264/how-can-i-update-a-broadcast-variable-in-spark-streaming. The answer suggests something good, but i am still confused that when to call the update process? – Soundsoul Feb 15 '16 at 07:03

1 Answers1

3

Refer to the last answer in the thread you referred. Summary - instead of sending the data, send the caching code to update data at the needed interval

  1. Create CacheLookup object that updates daily@12 am
  2. Wrap that in Broadcast variable
  3. Use CacheLookup as part of streaming logic
Ravi Reddy
  • 186
  • 3
  • 8
  • By item 3 (use xx as part of streaming logic), you mean to check the CacheLookup object in each processing batch? And only daily@12 am will this check work functionally due to the changing of update controlling flag? – Soundsoul Mar 02 '16 at 23:17
  • For any reference data lookup use "CacheLookup.lookup(id)". In your CacheLookup, maintain last read time for cached data. If does not exist then read it from DB. Set the Cach TTL (time to live) to one day so that it gets refreshed at that frequency – Ravi Reddy Mar 07 '16 at 20:11