2

I keep in a mapWithState a pair composed of String as key and an Object that contains an array as State. I'm updating the array if a new stream containing the same key appears. Is their a possibility that the array will be updated twice if the spark app runs on multiple nodes, or is spark letting only one node at a time to update the state? I don't now exactly how the mapWithState execution model works.

Thank you!

Vlad
  • 103
  • 9

1 Answers1

1

StateSpec function will be called for each key value pair so there can be multiple updates per batch but individual updates are sequential and operate on partitioned data so there will be no update conflicts if this is what you're worried about.

Graham
  • 7,431
  • 18
  • 59
  • 84
zero323
  • 322,348
  • 103
  • 959
  • 935
  • 1
    Thank you! Yes, I was worried that in a multi-node configuration, the state might be updated by two nodes in the same time that would alter the accuracy of the arrays data. – Vlad Aug 10 '16 at 07:37
  • See [this SO question](http://stackoverflow.com/questions/36151354/spark-mapwithstate-shuffles-all-data-to-one-node), Spark will shuffle the data based on the key. – jifeng.yin Jan 09 '17 at 07:37
  • @zero323 How is the state shared/distributed amongst Workers? – CᴴᴀZ Feb 22 '17 at 11:02
  • 1
    @CᴴᴀZ It is not shared. It uses partitioning pretty much the same way as byKey operations. – zero323 Feb 22 '17 at 12:41