I am using a spark accumulator to collect statistics of each pipelines.
In a typical pipeline i would read a data_frame :
df = spark.read.format(csv).option("header",'true').load('/mnt/prepared/orders')
df.count() ==> 7 rows
Then i would actually write it in two diferent locations:
df.write.format(delta).option("header",'true').load('/mnt/prepared/orders')
df.write.format(delta).option("header",'true').load('/mnt/reporting/orders_current/')
Unfortunately my accumulator statistics get updated each write
operations. It gives a figure of 14 rows read, while i have only read the input dataframe once.
How can I make my accumulator properly reflects the number of rows that i actually read.
I am a newbie in spark. have checked several threads around the issue, but did not find my answer. Statistical accumulator in Python spark Accumulator reset When are accumulators truly reliable?