I have a list of members, which have many attributes, two of which being a name and an ID. I wish to get a list of tuples in an RDD. The tuples will contain the ID
as the first element, and a count of the unique
number of names associated with the ID as the second element.
e.g. like: ID, <# of unique names associated with ID>
.
here's the code that I have written to accomplish this:
IDnametuple = members.map(lambda a: (a.ID, a.name)) # extract only ID and name
idnamelist = IDnametuple.groupByKey() # group the IDs together
idnameunique_count = (idnamelist
# set(tup[1]) should extract unique elements,
# and len should tell the number of them
.map(lambda tup: (tup[0], len(set(tup[1])))))
It is incredibly slow, and much slower than similar operations that count unique attributes for each member.
Is there a quicker way to do this? I tried to use as many built-ins as possible, which is the correct way to speed things up, from what I've heard.