Suppose I have the following RDD in pyspark, where each row is a list:
[foo, apple]
[foo, orange]
[foo, apple]
[foo, apple]
[foo, grape]
[foo, grape]
[foo, plum]
[bar, orange]
[bar, orange]
[bar, orange]
[bar, grape]
[bar, apple]
[bar, apple]
[bar, plum]
[scrog, apple]
[scrog, apple]
[scrog, orange]
[scrog, orange]
[scrog, grape]
[scrog, plum]
I would like to show the top 3 fruit (index 1) for each group (index 0), ordered by the count of fruit. Suppose for the sake of simplicity, not caring much about ties (e.g. scrog
has count 1 for grape
and plum
; don't care which).
My goal is output like:
foo, apple, 3
foo, grape, 2
foo, orange, 1
bar, orange, 3
bar, apple, 2
bar, plum, 1 # <------- NOTE: could also be "grape" of count 1
scrog, orange, 2 # <---------- NOTE: "scrog" has many ties, which is okay
scrog, apple, 2
scrog, grape, 1
I can think of a likely inefficient approach:
- get unique groups and
.collect()
as list - filter full
rdd
by group, count and sort fruits - use something like
zipWithIndex()
to get top 3 counts - save as new RDD with format
(<group>, <fruit>, <count>)
- union all RDDs at end
But I'm interested in not only more spark specific approaches, but ones that might skip expensive actions like collect()
and zipWithIndex()
.
As a bonus -- but not required -- if I did want to apply sorting/filtering to address ties, where that might best be accomplished.
Any advice much appreciated!
UPDATE: in my context, unable to use dataframes; must use RDDs only.