pyspark accumulators - understanding their use

Question

I would like to understand what is the use of accumulators. Based upon online examples it seems that we can use to them to count specific issues with the data. For example I have a lot of license numbers, i can count how many of them are not valid using accumulators. But cannot we do the same using filter and map operations? Would it be possible to show a good example where accumulators are used? I would appreciate if you provide sample code in pyspark instead of java or scala

score 1 · Answer 1 · edited May 23 '17 at 12:31

Accumulators are used mostly for diagnostics and retrieving additional data from the actions and typically shouldn't be used as a part of the main logic, especially when called inside transformations*.

Lets start with the first case. You can use accumulator or named accumulator to monitor program execution in close-to-real time (updated per task) and for example kill the job if you encounter to many invalid records. State of the named accumulators can be monitored for example using driver UI.

In case of actions it can used to get additional statistics. For example if you use foreach, foreachPartition to push data to external system you can use accumulators to keep track of failures.

* When are accumulators truly reliable?

pyspark accumulators - understanding their use

1 Answers1