1

I currently have some code that computes the overall time taken to run the count operation on a dataframe. I have another implementation which measures the time taken to run count on a sampled version of this dataframe.

sampled_df = df.sample(withReplacement=False, fraction=0.1)
sampled_df.count()

I then extrapolate the overall count from the sampled count. But I do not see an overall decrease in the time taken for calculating this sampled count when compared to doing a count on the whole dataset. Both seem to take around 40 seconds. Is there a reason this happens? Also, is there an improvement in terms of memory when using a sampled count over count on whole dataframe?

Ajayv
  • 374
  • 2
  • 13

1 Answers1

0

You can use countApprox. This lets you choose how long your willing to wait for an approximate count/confidence interval.

Sample still needs to access all partitions to create a sample that is uniform. You aren't really saving anytime using a sample.

Matt Andruff
  • 4,974
  • 1
  • 5
  • 21
  • Are there any memory benefits to doing a count on a sample instead of the whole dataframe? Even if time does not reduce, if it saves some memory, that would be of great help! – Ajayv Aug 09 '22 at 19:00
  • the most expensive operation is visiting every partition, which is why you aren't seeing any performance boost right now. – Matt Andruff Aug 10 '22 at 13:18