PySpark sampleBy different fraction per column

Asked Aug 23 '18 at 21:43

Active Aug 24 '18 at 19:19

Viewed 1,097 times

I have a dataframe df that contains a column "freq" that for each row represents the probability that the row is selected for output. I am currently sampling using sampleBy:

frac = dict(
    (e.freq, e.freq)
    for e
    in df.select("freq").distinct().collect()
)
result = df.sampleBy("freq", fractions = frac)

This is inspired by this, but it does not seem too clean. Is there a way to do the same thing that avoids creating a dummy dictionary that acts like lambda x: x

Edit:

Say the dataset is

+-----+----+---- 
| Name|freq| other columns
|Alice| 0.3|
|  Bob| 0.2|
|  Joe| 0.3|
...

I want the final dataframe to contain Alice and Joe's rows each with probability 0.3, Bob's row with probability 0.2 and so on.

edited Jun 20 '20 at 09:12

Community

asked Aug 23 '18 at 21:43

stfgrm

PySpark sampleBy different fraction per column

0 Answers0