1

Let's say I have a Spark DataFrame with the following schema:

root
 | -- prob: Double
 | -- word: String

I'd like to randomly select two different words from this DataFrame, but I'd like to perform this action X amount of times, so at the end I'll have X tuples of words selected at random, and of course every selection id independent of each other. How do I accomplish this?

EXAMPLE:

Let's say this is my data-set:

[(0.1,"blue"),(0.2,"yellow"),(0.1,"red"),(0.6,"green")]

where the first number id prob and the second is the word. For X=5 the output will be:

1. blue, green
2. green, yellow
3. green, yellow
4. yellow, blue
5. green, red

As they are independent actions, you can see that 2 and 3 are the same, and that's fine. But in every tuple, a word can only repeat once.

shakedzy
  • 2,853
  • 5
  • 32
  • 62

1 Answers1

1

1) You can use one of this DataFrame methods:

  • randomSplit(weights: Array[Double], seed: Long)
  • randomSplitAsList(weights: Array[Double], seed: Long) or
  • sample(withReplacement: Boolean, fraction: Double)

and then take first two Rows.

2) Shuffle rows and take first two of them.

import org.apache.spark.sql.functions.rand
dataset.orderBy(rand()).limit(n)

3) Or you can use takeSample method of the RDD and then convert it to a DataFrame:

def takeSample(
      withReplacement: Boolean,
      num: Int,
      seed: Long = Utils.random.nextLong): Array[T]

For example:

dataframe.rdd.takeSample(true, 1000).toDF()
Andrew Tobilko
  • 48,120
  • 14
  • 91
  • 142
Yehor Krivokon
  • 837
  • 5
  • 17
  • What you mentioned is the process of getting one tuple of words - I need to repeat this process several times. If I repeat this process in a loop, will Spark parallelize it, or do it one by one? – shakedzy Jan 22 '18 at 10:03
  • Each iteration of a loop will be done in parallel, but I don't think that Spark will parallelize all loop process. – Yehor Krivokon Jan 22 '18 at 10:14
  • Is there a way of making it parallelize the loops too? I mean, they are independent, so it's a waste doing it as a sequence.. – shakedzy Jan 22 '18 at 11:07