Getting the count of records in a data frame quickly

Question

I have a dataframe with as many as 10 million records. How can I get a count quickly? df.count is taking a very long time.

What is 'a very long time'? Can you tell us more about what and how you're trying to count? — Nathaniel Ford, Sep 06 '16 at 21:20
See http://stackoverflow.com/questions/28413423/count-number-of-rows-in-an-rdd and also the `countApprox` method in spark if you don't need an exact answer. — Alfredo Gimenez, Sep 06 '16 at 21:32
Possible duplicate of [Count number of rows in an RDD](http://stackoverflow.com/questions/28413423/count-number-of-rows-in-an-rdd) — DNA, Apr 24 '17 at 14:38

score 61 · Answer 1 · answered Aug 09 '18 at 15:19

61

It's going to take so much time anyway. At least the first time.

One way is to cache the dataframe, so you will be able to more with it, other than count.

E.g

df.cache()
df.count()

Subsequent operations don't take much time.

answered Aug 09 '18 at 15:19

Ravi

1,811
1
18
31

score 6 · Answer 2 · answered Jan 18 '21 at 13:10

The time it takes to count the records in a DataFrame depends on the power of the cluster and how the data is stored. Performance optimizations can make Spark counts very quick.

It's easier for Spark to perform counts on Parquet files than CSV/JSON files. Parquet files store counts in the file footer, so Spark doesn't need to read all the rows in the file and actually perform the count, it can just grab the footer metadata. CSV / JSON files don't have any such metadata.

If the data is stored in a Postgres database, then the count operation will be performed by Postgres and count execution time will be a function of the database performance.

Bigger clusters generally perform count operations faster (unless the data is skewed in a way that causes one node to do all the work, leaving the other nodes idle).

The snappy compression algorithm is generally faster than gzip cause it is splittable by Spark and faster to inflate.

approx_count_distinct that's powered by HyperLogLog under the hood will be more performant for distinct counts, at the cost of precision.

The other answer suggests caching before counting, which will actually slow down the count operation. Caching is an expensive operation that can take a lot more time that counting. Caching is an important performance optimization at times, but not if you just want a simple count.

would it be possible to give a scenario where caching of a large df and the associated costs are justified? — Bonson, Feb 28 '21 at 11:44
@Bonson - yes, caching is a powerful pattern that can speed certain types of queries a lot, especially for a DataFrame that'll get reused a lot. Suppose you have a DF, perform a big filtering operation, and then do a bunch of different types of computations on the filtered DF. Caching the filtered DF might help a lot. It can be good to repartition before caching, depending on the data. In short, yes, caching can help, but it can also hurt, so it needs to be applied intelligently. — Powers, Feb 28 '21 at 17:56

Getting the count of records in a data frame quickly

2 Answers2

Linked

Related