I would like to use GroupBy operator on a DataFrame with my own equality comparators.
Let's assume that I want to execute something like:
df.groupBy("Year","Month").sum("Counter")
In this DataFrame:
Year | Month | Counter
---------------------------
2012 | Jan | 100
12 | January | 200
12 | Janu | 300
2012 | Feb | 400
13 | Febr | 500
I have to implement two comparators:
1) For column Year: p.e. "2012" == "12"
2) For column Month: p.e. "Jan" == "January" == "Janu"
Let's assume that I already implemented these two comparators. How can I invoke them? As in this example, I already know that I have to convert my DataFrame into an RDD to make possible to use my comparators.
I thought about using RDD GroupBy.
Note that I really need to do this using comparators. I can't use UDFs, change the data or create new columns. The future idea is to have ciphertext columns, in which I have functions that allow me to compare if two ciphertexts are the same. I want to use them in my comparators.
Edit:
In this moment, I am trying to do this with only one column, like:
df.groupBy("Year").sum("Counter")
I have a Wrapper class:
class ExampleWrapperYear (val year: Any) extends Serializable {
// override hashCode and Equals methods
}
Then, I am doing this:
val rdd = df.rdd.keyBy(a => new ExampleWrapperYear(a(0))).groupByKey()
My question here is how to do the "sum", and how to use keyBy with multiple columns to use ExampleWrapperYear and ExampleWrapperMonth.