-1

I am a ios developer and now switching into spark. I want to know how I can find Max and Min in Spark RDD with one aggregate function? (Preferred Spark SQL ) For example- Suppose I have salary column in my csv file and now I want to convert it in to Spark RDD and I want to find Max and min by using single Function and Also How can I load csv file in RDD (Scala preferred). I do not want to convert it in Data frame. I want to find max and min with single aggregate func and that too in RDD. I so not want to use Max and Min twice.

John smith
  • 19
  • 1
  • 5
  • can you explain your use case a bit? Also some example code of what you are trying to accomplish will be helpful. – Tawkir Jun 01 '17 at 11:36
  • Suppose I have salary column in my csv file and now I want to convert it in to Spark RDD and I want to find Max and min by using single Function. – John smith Jun 01 '17 at 12:00

2 Answers2

2

You can use aggregate function to perform custom aggregation.

Aggregated value should be a custom object that stores both min and max

case class MinMax[T](min: T, max: T)

it requires 2 functions to combine aggregated results and adding new value to aggregation

def comb[T](left: MinMax[T], right: MinMax[T])(implicit ordering: Ordering[T]): MinMax[T] = {
  MinMax(min = ordering.min(left.min, right.min), max = ordering.max(left.max, right.max))
}

def seq[T](minMax: MinMax[T], value: T)(implicit ordering: Ordering[T]): MinMax[T] = {
  comb(minMax, MinMax(value, value))
}

then having those you can aggregate, for example rdd with Long

val minMax = rdd.aggregate(MinMax(Long.MaxValue, Long.MinValue))((mm, t) => seq(mm, t), (l, r) => comb(l, r))
val min = minMax.min
val max = minMax.max
Nazarii Bardiuk
  • 4,272
  • 1
  • 19
  • 22
  • Is there any way to perform it in Spark SQL? – John smith Jun 01 '17 at 12:14
  • Yes, take a look at this answer https://stackoverflow.com/a/36051300/187261 – Nazarii Bardiuk Jun 01 '17 at 12:17
  • Thank you so much for the help, I do have 2 question. Would you mind I mail you? both question is not hard but since I am new to spark so I am facing issues. – John smith Jun 01 '17 at 12:23
  • Yes. Although you can create new questions here in StackOverflow and somebody will answer them quicker – Nazarii Bardiuk Jun 01 '17 at 13:28
  • a) Write a Spark Job to read the sales data (emp.csv) and find minimum and maximum salary value using single aggregate() function with spark RDD b) Write a Spark Job to read sales data (emp.csv) and find employee wise minimum and maximum salary value using single foldByKey() function with spark RDD. (Spark Sql) – John smith Jun 01 '17 at 14:45
  • Create new separate questions, this discussion is not for comments – Nazarii Bardiuk Jun 01 '17 at 15:01
  • Apologies I am new to stackoverflow. I have created new question https://stackoverflow.com/questions/44311081/how-to-write-these-jobs. – John smith Jun 01 '17 at 15:05
1

One of the method to find Max and Min in spark scala is to convert your RDD to dataframe and find Min and Max in aggregation more info

Ramesh Maharjan
  • 41,071
  • 6
  • 69
  • 97