How to randomly selecting rows from one dataframeusing information from another dataframe

Question

The following I am attempting in Scala-Spark.

I'm hoping someone can give me some guidance on how to tackle this problem or provide me with some resources to figure out what I can do.

I have a dateCountDF with a count corresponding to a date. I would like to randomly select a certain number of entries for each dateCountDF.month from another Dataframe entitiesDF where dateCountDF.FirstDate<entitiesDF.Date && entitiesDF.Date <= dateCountDF.LastDate and then place all the results into a new Dataframe. See Bellow for Data Example

I'm not at all sure how to approach this problem from a Spark-SQl or Spark-MapReduce perspective. The furthest I got was the naive approach, where I use a foreach on a dataFrame and then refer to the other dataframe within the function. But this doesn't work because of the distributed nature of Spark.

val randomEntites = dateCountDF.foreach(x => {
  val count:Int = x(1).toString().toInt 
  val result = entitiesDF.take(count)
  return result
})

DataFrames

**dateCountDF**
|   Date   |      Count     |
+----------+----------------+
|2016-08-31|               4|
|2015-12-31|               1|
|2016-09-30|               5|
|2016-04-30|               5|
|2015-11-30|               3|
|2016-05-31|               7|
|2016-11-30|               2|
|2016-07-31|               5|
|2016-12-31|               9|
|2014-06-30|               4|
+----------+----------------+
only showing top 10 rows

**entitiesDF**
|    ID    |     FirstDate   | LastDate |
+----------+-----------------+----------+
|       296|       2014-09-01|2015-07-31|
|       125|       2015-10-01|2016-12-31|
|       124|       2014-08-01|2015-03-31|
|       447|       2017-02-01|2017-01-01|
|       307|       2015-01-01|2015-04-30|
|       574|       2016-01-01|2017-01-31|
|       613|       2016-04-01|2017-02-01|
|       169|       2009-08-23|2016-11-30|
|       205|       2017-02-01|2017-02-01|
|       433|       2015-03-01|2015-10-31|
+----------+-----------------+----------+
only showing top 10 rows

Edit: For clarification. My inputs are entitiesDF and dateCountDF. I want to loop through dateCountDF and for each row I want to select a random number of entities in entitiesDF where dateCountDF.FirstDate<entitiesDF.Date && entitiesDF.Date <= dateCountDF.LastDate

what is the input and what output you want can explain again with example — learner, Apr 13 '17 at 10:04
@RahulNirdhar Thanks Rahul. See my added edit. Let me know if this clarifies things? — J Schmidt, Apr 13 '17 at 11:13

score 0 · Answer 1 · edited May 23 '17 at 12:34

To select random you do like this in scala

import random 
def sampler(df, col, records):

  # Calculate number of rows
  colmax = df.count()

  # Create random sample from range
  vals = random.sample(range(1, colmax), records)

  # Use 'vals' to filter DataFrame using 'isin'
  return df.filter(df[col].isin(vals))

select random number of rows you want store in dataframe and the add this data in the another dataframe for this you can use unionAll.

also you can refer this answer

How to randomly selecting rows from one dataframeusing information from another dataframe

1 Answers1