I have a simple spark application that I am currently writing that processes billions of records (about 10 TB worth), performs certain transformations on them, and then writes them to some database. One thing that I am struggling with is trying to figure out how to collect general stats (e.g. errors, number of records processed, etc.) about my application. The stats that I am interested in can be represented as simple counters.
The code for my application has the general pattern as below.
sparkSession.sparkContext.parallelize(data)
.map( ...)
.map(....)
.map(....)
.distinct()
.map(...)
.collect()
From my research, it looks like I can use accumulators for this.
https://spark.apache.org/docs/1.6.2/api/java/org/apache/spark/Accumulator.html
The idea is that I would create one accumulator for each statistics/counter and then pass this to my functions. The functions would then increment the accumulators when something significant happens (e.g. an error occurs) and at the very end (that is when the driver calls "collect()") I could then just log the accumulators.
The thing that I am uncertain about is whether this is the right strategy. I was reading that accumulators should only be used in "actions" and my transformations are occurring in "map" which is not an action.
Any guidance on what the best practice for collecting stats would be greatly appreciated.
Thanks!