I have two pyspark dataframe tdf
and fdf
, where fdf
is extremely larger than tdf
. And the sizes of these dataframes are changing daily, and I don't know them. I want to randomly pick data from fdf
to compose a new dataframe rdf
, where size of rdf
is approximately equal to the size of tdf
. Currently I have these lines:
tdf = tdf.count()
fdf = fdf.count()
sampling_fraction = float(tdf) / float(fdf)
rdf = fdf(sampling_fraction, SEED)
These lines produce correct result. But when the size of fdf
is increasing, the fdf.count()
takes a few days to finish. Can you suggest another approach that is faster in PySpark?