1

I have two pyspark dataframe tdf and fdf, where fdf is extremely larger than tdf. And the sizes of these dataframes are changing daily, and I don't know them. I want to randomly pick data from fdf to compose a new dataframe rdf, where size of rdf is approximately equal to the size of tdf. Currently I have these lines:

tdf = tdf.count()
fdf = fdf.count()
sampling_fraction = float(tdf) / float(fdf)
rdf = fdf(sampling_fraction, SEED)

These lines produce correct result. But when the size of fdf is increasing, the fdf.count() takes a few days to finish. Can you suggest another approach that is faster in PySpark?

Marioanzas
  • 1,663
  • 2
  • 10
  • 33
Neuronix
  • 65
  • 4

1 Answers1

2

You can try sampling from the dataframe to get an estimated count:

ratio = 0.01
fdf_estimate = fdf.sample(fraction=ratio).count() / ratio

You can change the ratio to an appropriate value so that it gives a reasonable performance.

mck
  • 40,932
  • 13
  • 35
  • 50