I am a ios developer and now switching into spark. I want to know how I can find Max and Min in Spark RDD with one aggregate function? (Preferred Spark SQL ) For example- Suppose I have salary column in my csv file and now I want to convert it in to Spark RDD and I want to find Max and min by using single Function and Also How can I load csv file in RDD (Scala preferred). I do not want to convert it in Data frame. I want to find max and min with single aggregate func and that too in RDD. I so not want to use Max and Min twice.
Asked
Active
Viewed 3,168 times
-1
-
can you explain your use case a bit? Also some example code of what you are trying to accomplish will be helpful. – Tawkir Jun 01 '17 at 11:36
-
Suppose I have salary column in my csv file and now I want to convert it in to Spark RDD and I want to find Max and min by using single Function. – John smith Jun 01 '17 at 12:00
2 Answers
2
You can use aggregate function to perform custom aggregation.
Aggregated value should be a custom object that stores both min and max
case class MinMax[T](min: T, max: T)
it requires 2 functions to combine aggregated results and adding new value to aggregation
def comb[T](left: MinMax[T], right: MinMax[T])(implicit ordering: Ordering[T]): MinMax[T] = {
MinMax(min = ordering.min(left.min, right.min), max = ordering.max(left.max, right.max))
}
def seq[T](minMax: MinMax[T], value: T)(implicit ordering: Ordering[T]): MinMax[T] = {
comb(minMax, MinMax(value, value))
}
then having those you can aggregate, for example rdd with Long
val minMax = rdd.aggregate(MinMax(Long.MaxValue, Long.MinValue))((mm, t) => seq(mm, t), (l, r) => comb(l, r))
val min = minMax.min
val max = minMax.max

Nazarii Bardiuk
- 4,272
- 1
- 19
- 22
-
-
Yes, take a look at this answer https://stackoverflow.com/a/36051300/187261 – Nazarii Bardiuk Jun 01 '17 at 12:17
-
Thank you so much for the help, I do have 2 question. Would you mind I mail you? both question is not hard but since I am new to spark so I am facing issues. – John smith Jun 01 '17 at 12:23
-
Yes. Although you can create new questions here in StackOverflow and somebody will answer them quicker – Nazarii Bardiuk Jun 01 '17 at 13:28
-
a) Write a Spark Job to read the sales data (emp.csv) and find minimum and maximum salary value using single aggregate() function with spark RDD b) Write a Spark Job to read sales data (emp.csv) and find employee wise minimum and maximum salary value using single foldByKey() function with spark RDD. (Spark Sql) – John smith Jun 01 '17 at 14:45
-
Create new separate questions, this discussion is not for comments – Nazarii Bardiuk Jun 01 '17 at 15:01
-
Apologies I am new to stackoverflow. I have created new question https://stackoverflow.com/questions/44311081/how-to-write-these-jobs. – John smith Jun 01 '17 at 15:05
1
One of the method to find Max
and Min
in spark scala
is to convert your RDD
to dataframe
and find Min
and Max
in aggregation
more info

Ramesh Maharjan
- 41,071
- 6
- 69
- 97