How can we sample from a large data in PySpark quickly when we don't the the size of dataframe?

Question

I have two pyspark dataframe tdf and fdf, where fdf is extremely larger than tdf. And the sizes of these dataframes are changing daily, and I don't know them. I want to randomly pick data from fdf to compose a new dataframe rdf, where size of rdf is approximately equal to the size of tdf. Currently I have these lines:

tdf = tdf.count()
fdf = fdf.count()
sampling_fraction = float(tdf) / float(fdf)
rdf = fdf(sampling_fraction, SEED)

These lines produce correct result. But when the size of fdf is increasing, the fdf.count() takes a few days to finish. Can you suggest another approach that is faster in PySpark?

The answer you are looking for is here https://stackoverflow.com/questions/15943769/how-do-i-get-the-row-count-of-a-pandas-dataframe — Marco Massetti, Jan 29 '21 at 22:45

score 2 · Accepted Answer · answered Jan 30 '21 at 07:59

2

You can try sampling from the dataframe to get an estimated count:

ratio = 0.01
fdf_estimate = fdf.sample(fraction=ratio).count() / ratio

You can change the ratio to an appropriate value so that it gives a reasonable performance.

answered Jan 30 '21 at 07:59

mck

40,932
13
35
50

How can we sample from a large data in PySpark quickly when we don't the the size of dataframe?

1 Answers1