I am currently building a Spark application and I'd like to log some stats from my intermediary RDDs. All I need is the size of the RDDs at different steps of the transformations.
The transformations are linear, therefore I don't need to cache anything, except to call .count()
(which doesn't need to be accurate nor synchronous).
I could use countAsync
to become asynchronous:
val rawData = sc.textFile(path)
val (values, malformedRows) = parse(rawData)
values.countAsync.onComplete(size => logger.info(s"${malformedRows.count} malformed rows while parsing ${path} (${size} successfully parsed)"))
val result = values
.map(operator)
.filter(predicate)
result.countAsync.onComplete(size => logger.info(s"${size} final rows"))
result.saveAsTextFile(output)
However, my understanding is that the countAsync
will trigger a new action (as count
does). And since my RDDs will eventually be computed (by the last saveAsTextFile
action), I'm wondering if it is possible to capitalize on the last action in order to log the stats without creating new actions.
Is there any way to have an approximate asynchronous count without triggering a new action? In other word to have a lazy count()
?