I am a new bee to spark and I am trying to perform a group by and count using the following spark functions:
Dataset<Row> result = dataset
.groupBy("column1", "column2")
.count();
But I read here that using group by is not a good idea since it does not have a combiner, which in turn affects the spark job's runtime efficiency. Instead, one should use reduceByKey function for aggregation operations.
So I tried using reduceByKey
function, but it is not available for dataset
. Instead, datasets use reduce(ReduceFunction<Row> func)
.
Since I can not find an example to perform group and count with reduce function, I tried converting it to JavaRDD
and used reduceByKey
:
//map each row to 1 and then group them by key
JavaPairRDD<String[], Integer> mapOnes;
try {
mapOnes = dailySummary.javaRDD().mapToPair(
new PairFunction<Row, String[], Integer>() {
@Override
public Tuple2<String[], Integer> call(Row t) throws Exception {
return new Tuple2<String[], Integer>(new String[]{t.getAs("column1"), t.getAs("column2")}, 1);
}
});
}catch(Exception e) {
log.error("exception in mapping ones: "+e);
throw new Exception();
}
JavaPairRDD<String[], Integer> rowCount;
try {
rowCount = mapOnes.reduceByKey(
new Function2<Integer, Integer, Integer>() {
@Override
public Integer call(Integer v1, Integer v2) throws Exception {
return v1+v2;
}
});
}catch(Exception e) {
log.error("exception in reduce by key: "+e);
throw new Exception();
}
But this is also giving exception as org.apache.spark.SparkException: Task not serializable
for mapToPair
function.
Can anyone suggest a better way to group and perform count using dataset's reduce
and map
function.
Any help is appreciated.