I have a dataframe df
that contains a column "freq"
that for each row represents the probability that the row is selected for output. I am currently sampling using sampleBy:
frac = dict(
(e.freq, e.freq)
for e
in df.select("freq").distinct().collect()
)
result = df.sampleBy("freq", fractions = frac)
This is inspired by this, but it does not seem too clean. Is there a way to do the same thing that avoids creating a dummy dictionary that acts like lambda x: x
Edit:
Say the dataset is
+-----+----+----
| Name|freq| other columns
|Alice| 0.3|
| Bob| 0.2|
| Joe| 0.3|
...
I want the final dataframe to contain Alice and Joe's rows each with probability 0.3, Bob's row with probability 0.2 and so on.