0

The following I am attempting in Scala-Spark.

I'm hoping someone can give me some guidance on how to tackle this problem or provide me with some resources to figure out what I can do.

I have a dateCountDF with a count corresponding to a date. I would like to randomly select a certain number of entries for each dateCountDF.month from another Dataframe entitiesDF where dateCountDF.FirstDate<entitiesDF.Date && entitiesDF.Date <= dateCountDF.LastDate and then place all the results into a new Dataframe. See Bellow for Data Example

I'm not at all sure how to approach this problem from a Spark-SQl or Spark-MapReduce perspective. The furthest I got was the naive approach, where I use a foreach on a dataFrame and then refer to the other dataframe within the function. But this doesn't work because of the distributed nature of Spark.

val randomEntites = dateCountDF.foreach(x => {
  val count:Int = x(1).toString().toInt 
  val result = entitiesDF.take(count)
  return result
})

DataFrames

**dateCountDF**
|   Date   |      Count     |
+----------+----------------+
|2016-08-31|               4|
|2015-12-31|               1|
|2016-09-30|               5|
|2016-04-30|               5|
|2015-11-30|               3|
|2016-05-31|               7|
|2016-11-30|               2|
|2016-07-31|               5|
|2016-12-31|               9|
|2014-06-30|               4|
+----------+----------------+
only showing top 10 rows

**entitiesDF**
|    ID    |     FirstDate   | LastDate |
+----------+-----------------+----------+
|       296|       2014-09-01|2015-07-31|
|       125|       2015-10-01|2016-12-31|
|       124|       2014-08-01|2015-03-31|
|       447|       2017-02-01|2017-01-01|
|       307|       2015-01-01|2015-04-30|
|       574|       2016-01-01|2017-01-31|
|       613|       2016-04-01|2017-02-01|
|       169|       2009-08-23|2016-11-30|
|       205|       2017-02-01|2017-02-01|
|       433|       2015-03-01|2015-10-31|
+----------+-----------------+----------+
only showing top 10 rows

Edit: For clarification. My inputs are entitiesDF and dateCountDF. I want to loop through dateCountDF and for each row I want to select a random number of entities in entitiesDF where dateCountDF.FirstDate<entitiesDF.Date && entitiesDF.Date <= dateCountDF.LastDate

Jonathan Hall
  • 75,165
  • 16
  • 143
  • 189
J Schmidt
  • 618
  • 1
  • 6
  • 19

1 Answers1

0

To select random you do like this in scala

import random 
def sampler(df, col, records):

  # Calculate number of rows
  colmax = df.count()

  # Create random sample from range
  vals = random.sample(range(1, colmax), records)

  # Use 'vals' to filter DataFrame using 'isin'
  return df.filter(df[col].isin(vals))

select random number of rows you want store in dataframe and the add this data in the another dataframe for this you can use unionAll.

also you can refer this answer

Community
  • 1
  • 1
learner
  • 344
  • 2
  • 22