Parallelizing independent actions on the same DataFrame in Spark

Question

Let's say I have a Spark DataFrame with the following schema:

root
 | -- prob: Double
 | -- word: String

I'd like to randomly select two different words from this DataFrame, but I'd like to perform this action X amount of times, so at the end I'll have X tuples of words selected at random, and of course every selection id independent of each other. How do I accomplish this?

EXAMPLE:

Let's say this is my data-set:

[(0.1,"blue"),(0.2,"yellow"),(0.1,"red"),(0.6,"green")]

where the first number id prob and the second is the word. For X=5 the output will be:

1. blue, green
2. green, yellow
3. green, yellow
4. yellow, blue
5. green, red

As they are independent actions, you can see that 2 and 3 are the same, and that's fine. But in every tuple, a word can only repeat once.

hope [this](https://stackoverflow.com/a/39345502/1025328) helps — Prasad Khode, Jan 22 '18 at 09:51
Not too much :/ it's about taking n rows from a dataset once, while I need to do this several times in parallel — shakedzy, Jan 22 '18 at 09:59
This is not clear to me. Can you be more specific please ? With an example maybe. — eliasah, Jan 22 '18 at 10:53
@addmeaning this is different because in my case, the same word can be selected for more than one tuple, as the different selections are independent. In this example, If I select 2*X rows, each row can only be selected once — shakedzy, Jan 22 '18 at 11:05
Randomly according to what distribution? Do you want prob to be used here? How many pairs do you need as a fraction of total? How many records do you have? — Alper t. Turker, Jan 22 '18 at 11:38
`prob` is used, yes. uniform distribution. X pairs and Y records (numbers are unknown). — shakedzy, Jan 22 '18 at 11:43

score 1 · Answer 1 · edited Jan 23 '18 at 22:58

1

1) You can use one of this DataFrame methods:

randomSplit(weights: Array[Double], seed: Long)
randomSplitAsList(weights: Array[Double], seed: Long) or
sample(withReplacement: Boolean, fraction: Double)

and then take first two Rows.

2) Shuffle rows and take first two of them.

import org.apache.spark.sql.functions.rand
dataset.orderBy(rand()).limit(n)

3) Or you can use takeSample method of the RDD and then convert it to a DataFrame:

def takeSample(
      withReplacement: Boolean,
      num: Int,
      seed: Long = Utils.random.nextLong): Array[T]

For example:

dataframe.rdd.takeSample(true, 1000).toDF()

edited Jan 23 '18 at 22:58

Andrew Tobilko

48,120
14
91
142

answered Jan 22 '18 at 09:59

Yehor Krivokon

837
5
17

What you mentioned is the process of getting one tuple of words - I need to repeat this process several times. If I repeat this process in a loop, will Spark parallelize it, or do it one by one? – shakedzy Jan 22 '18 at 10:03
Each iteration of a loop will be done in parallel, but I don't think that Spark will parallelize all loop process. – Yehor Krivokon Jan 22 '18 at 10:14
Is there a way of making it parallelize the loops too? I mean, they are independent, so it's a waste doing it as a sequence.. – shakedzy Jan 22 '18 at 11:07

Parallelizing independent actions on the same DataFrame in Spark

1 Answers1

Linked