I currently have some code that computes the overall time taken to run the count operation on a dataframe. I have another implementation which measures the time taken to run count on a sampled version of this dataframe.
sampled_df = df.sample(withReplacement=False, fraction=0.1)
sampled_df.count()
I then extrapolate the overall count from the sampled count. But I do not see an overall decrease in the time taken for calculating this sampled count when compared to doing a count on the whole dataset. Both seem to take around 40 seconds. Is there a reason this happens? Also, is there an improvement in terms of memory when using a sampled count over count on whole dataframe?