Let's say I have a Spark DataFrame
with the following schema:
root
| -- prob: Double
| -- word: String
I'd like to randomly select two different words from this DataFrame
, but I'd like to perform this action X amount of times, so at the end I'll have X tuples of words selected at random, and of course every selection id independent of each other. How do I accomplish this?
EXAMPLE:
Let's say this is my data-set:
[(0.1,"blue"),(0.2,"yellow"),(0.1,"red"),(0.6,"green")]
where the first number id prob
and the second is the word
. For X=5 the output will be:
1. blue, green
2. green, yellow
3. green, yellow
4. yellow, blue
5. green, red
As they are independent actions, you can see that 2 and 3 are the same, and that's fine. But in every tuple, a word can only repeat once.