0

Lets assume a case where we have to repartion the dataset after filter or to attain degree parallelism.

How we can perform dynamic repartionting instead of manual tuning of number of partitions?

Note - Looking solution for RDD, data frame and dataset.

Ravikumar
  • 1,121
  • 1
  • 12
  • 23
prady
  • 563
  • 4
  • 9
  • 24

1 Answers1

1

You can use repartition(colname) or partitionBy() to make dynamic partitioning of your dataset.

For example if your dataset is like as follows

 create table sensor_data (
  sensor_id bigint,
  temp  float,
  region_id  string,
  state  string,
  country   string
 ) partition by ( day string)

If you want to do region wise some calculation for a particular day,

val sensor_data = spark.sql("select * from sensor_data where day='2018-02-10')
val results = sensor_data.
     repartition(col("region_id")).
     mapPartitions( eventIter =>  {
       processEvent(eventIter).iterator
  })

 case Event(sensor_id: String, country: String, max_temp: float)


 def processEvent(evtIter: Iterator[Row]) : List[Event] = {
    val maxTempEvents =  ListBuffer[Event]()
    while (evtIter.hasNext) {
       val evt = evtIter.next()
       // do your calculation and add results to maxTempEvents list
    }
   maxTempEvents
 }

Hope this helps.

Thanks Ravi

Ravikumar
  • 1,121
  • 1
  • 12
  • 23
  • It would be great if you can give an example please. – prady Feb 14 '18 at 03:27
  • Thanks much, It would be great help if you can answer for the below question - [Link](https://stackoverflow.com/questions/48780625/how-do-we-calculate-the-input-data-size-and-feed-the-number-of-partitions-to-re) – prady Feb 14 '18 at 06:04