I would like to understand what is the use of accumulators. Based upon online examples it seems that we can use to them to count specific issues with the data. For example I have a lot of license numbers, i can count how many of them are not valid using accumulators. But cannot we do the same using filter and map operations? Would it be possible to show a good example where accumulators are used? I would appreciate if you provide sample code in pyspark instead of java or scala
Asked
Active
Viewed 1,582 times
1 Answers
1
Accumulators are used mostly for diagnostics and retrieving additional data from the actions and typically shouldn't be used as a part of the main logic, especially when called inside transformations*.
Lets start with the first case. You can use accumulator
or named accumulator
to monitor program execution in close-to-real time (updated per task) and for example kill the job if you encounter to many invalid records. State of the named accumulators can be monitored for example using driver UI.
In case of actions it can used to get additional statistics. For example if you use foreach
, foreachPartition
to push data to external system you can use accumulators to keep track of failures.