0

I have a simple spark application that I am currently writing that processes billions of records (about 10 TB worth), performs certain transformations on them, and then writes them to some database. One thing that I am struggling with is trying to figure out how to collect general stats (e.g. errors, number of records processed, etc.) about my application. The stats that I am interested in can be represented as simple counters.

The code for my application has the general pattern as below.

   sparkSession.sparkContext.parallelize(data)
        .map( ...)
        .map(....)
        .map(....)
        .distinct()
        .map(...)
        .collect()

From my research, it looks like I can use accumulators for this.

https://spark.apache.org/docs/1.6.2/api/java/org/apache/spark/Accumulator.html

The idea is that I would create one accumulator for each statistics/counter and then pass this to my functions. The functions would then increment the accumulators when something significant happens (e.g. an error occurs) and at the very end (that is when the driver calls "collect()") I could then just log the accumulators.

The thing that I am uncertain about is whether this is the right strategy. I was reading that accumulators should only be used in "actions" and my transformations are occurring in "map" which is not an action.

Any guidance on what the best practice for collecting stats would be greatly appreciated.

Thanks!

Jon
  • 1,381
  • 3
  • 16
  • 41
  • 1
    there are / were issues about reliability, but accumulators are the way to go here imo practically speaking – thebluephantom Jul 18 '20 at 09:01
  • 2
    You mean that the reports produced for Spark HistoryServer are not sufficient? These reports are complex JSON files that can be parsed (with some labor) to extract individual stats. And the HistoryServer itself also has a REST API. – Samson Scharfrichter Jul 18 '20 at 11:39
  • 1
    +1 to @SamsonScharfrichter. Its either parsing REST API response, or implementing your own listener (https://stackoverflow.com/questions/24463055/how-to-implement-custom-job-listener-tracker-in-spark) – mazaneicha Jul 18 '20 at 14:34
  • Thanks for input mazaneicha and SamsonScharfrichter. I was looking at the history server here (https://spark.apache.org/docs/latest/monitoring.html) but I am still unsure how I would implement what I want. The events that I would like are custom and specific to my application e.g. received error code X in one of my functions. Is there a way to generate custom events? Also just as an FYI in case it matters, I am using data bricks. – Jon Jul 18 '20 at 19:43

0 Answers0