2

I have a dataframe df that contains a column "freq" that for each row represents the probability that the row is selected for output. I am currently sampling using sampleBy:

frac = dict(
    (e.freq, e.freq)
    for e
    in df.select("freq").distinct().collect()
)
result = df.sampleBy("freq", fractions = frac) 

This is inspired by this, but it does not seem too clean. Is there a way to do the same thing that avoids creating a dummy dictionary that acts like lambda x: x

Edit:

Say the dataset is

+-----+----+---- 
| Name|freq| other columns
|Alice| 0.3|
|  Bob| 0.2|
|  Joe| 0.3|
...

I want the final dataframe to contain Alice and Joe's rows each with probability 0.3, Bob's row with probability 0.2 and so on.

Community
  • 1
  • 1
stfgrm
  • 21
  • 4

0 Answers0