2

I am currently building a Spark application and I'd like to log some stats from my intermediary RDDs. All I need is the size of the RDDs at different steps of the transformations.

The transformations are linear, therefore I don't need to cache anything, except to call .count() (which doesn't need to be accurate nor synchronous).

I could use countAsync to become asynchronous:

val rawData = sc.textFile(path)
val (values, malformedRows) = parse(rawData)
values.countAsync.onComplete(size => logger.info(s"${malformedRows.count} malformed rows while parsing ${path} (${size} successfully parsed)"))
val result = values
    .map(operator)
    .filter(predicate)
result.countAsync.onComplete(size => logger.info(s"${size} final rows"))
result.saveAsTextFile(output)

However, my understanding is that the countAsync will trigger a new action (as count does). And since my RDDs will eventually be computed (by the last saveAsTextFile action), I'm wondering if it is possible to capitalize on the last action in order to log the stats without creating new actions.

Is there any way to have an approximate asynchronous count without triggering a new action? In other word to have a lazy count()?

zero323
  • 322,348
  • 103
  • 959
  • 935
Simon
  • 31
  • 1
  • 5
  • 2
    Possible duplicate of [Spark: how to get the number of written rows?](http://stackoverflow.com/q/37496650/1560062) – zero323 Jun 05 '16 at 18:20
  • Thanks for the link. I'm looking for a way to count temporary RDDs as well, not only the one that is written into a file. But it seems that using an accumulator may solve this problem. – Simon Jun 06 '16 at 19:36
  • You're welcome. Before using accumulators be sure to understand their behavior in transformations. You'll find some basic info in the docs. – zero323 Jun 06 '16 at 20:11
  • I will. My issue with accumulators is that if I want to count in a submethod that does one filter, I need to put my log statement after writing the file. Thus I have code split into the local method that counts and the end of my job that logs I could probably design something that will be kind of generic, but it seems more complex than it should. – Simon Jun 07 '16 at 20:40

0 Answers0