Why method count( ) does not get true num of rows?

Question

I was running my program on a cluster and I encountered a strange problem. When I try to count the number of rows in a particular dataset ds, I find that count() does not return the correct value. When I add ds.persist() before ds.count(), count result is correct. Why is that? Doesn't count() itself have operations such as cache or persist? Why are rows loss?

Thank you for your help.

score 1 · Answer 1 · answered Sep 03 '21 at 03:19

I think it is do with the lazy evaluation. I remember an old thread on spark developer community that due to the optimisations on dataframe and dataset APIs, a count may not necessarily trigger the enitre dataframe/dataset evaluation. Hence the count may not be accurate.

However, if you do a df.rdd.count or ds.rdd.count or do a a cache or persist first on the dataset/dataframe and then do a count, it will evaluate the entire dataframe or dataset and the count will be accurate.

Looking at another thread How to force DataFrame evaluation in Spark please see the reply be Vince.Bdn which is in-line with my chain of thoughts.

If you want to further validate it, create a large dataframe and do the count before a persist and another one after the persist and look at the DAG, that should validate the same. in my case I went with a dataframe of 1 million records with 6 columns.

Why method count( ) does not get true num of rows?

1 Answers1