I have the following data where I need to group based on key and count the number based on key to monitor the metrics. I can use groupBy and do the count for that group but this involves some shuffle. Can we do without doing the shuffle ?
ID,TempID,PermanantID
----------
xxx, abcd, 12345
xxx, efg, 1345
xxx, ijk, 1534
xxx, lmn, 13455
xxx, null, 12345
xxx, axg, null
yyy, abcd, 12345
yyy, efg, 1345
yyy, ijk, 1534
zzz, lmn, 13455
zzz, abc, null
output should be
ID Count1 Count2
----------
XXX 5 5
YYY 3 3
ZZZ 2 1
I can do this with groupBy and count
dataframe.groupby("ID").agg(col("TempID").as("Count1"),count(col("PermanantID").as("Count2"))
can we do this using mapPartition ?