Pyspark: Doing a count on sample of dataframe instead whole dataframe

Question

I currently have some code that computes the overall time taken to run the count operation on a dataframe. I have another implementation which measures the time taken to run count on a sampled version of this dataframe.

sampled_df = df.sample(withReplacement=False, fraction=0.1)
sampled_df.count()

I then extrapolate the overall count from the sampled count. But I do not see an overall decrease in the time taken for calculating this sampled count when compared to doing a count on the whole dataset. Both seem to take around 40 seconds. Is there a reason this happens? Also, is there an improvement in terms of memory when using a sampled count over count on whole dataframe?

Please show the SQL DAG so that you can understand why it takes similar time — Jonathan Lam, Aug 09 '22 at 11:33
How big is your data? 40s to run the whole program doesn't seem like a lot. The actual count may be only taking a couple of seconds, and perhaps you are not seeing much of a difference overall. — viggnah, Aug 09 '22 at 11:39

score 0 · Answer 1 · answered Aug 09 '22 at 14:07

0

You can use countApprox. This lets you choose how long your willing to wait for an approximate count/confidence interval.

Sample still needs to access all partitions to create a sample that is uniform. You aren't really saving anytime using a sample.

answered Aug 09 '22 at 14:07

Matt Andruff

4,974
1
5
21

Are there any memory benefits to doing a count on a sample instead of the whole dataframe? Even if time does not reduce, if it saves some memory, that would be of great help! – Ajayv Aug 09 '22 at 19:00
the most expensive operation is visiting every partition, which is why you aren't seeing any performance boost right now. – Matt Andruff Aug 10 '22 at 13:18

Pyspark: Doing a count on sample of dataframe instead whole dataframe

1 Answers1