I would like to group by a value and then find the max value in each group using PySpark. I have the following code but now I am bit stuck on how to extract the max value.
# some file contains tuples ('user', 'item', 'occurrences')
data_file = sc.textData('file:///some_file.txt')
# Create the triplet so I index stuff
data_file = data_file.map(lambda l: l.split()).map(lambda l: (l[0], l[1], float(l[2])))
# Group by the user i.e. r[0]
grouped = data_file.groupBy(lambda r: r[0])
# Here is where I am stuck
group_list = grouped.map(lambda x: (list(x[1]))) #?
Returns something like:
[[(u'u1', u's1', 20), (u'u1', u's2', 5)], [(u'u2', u's3', 5), (u'u2', u's2', 10)]]
I want to find max 'occurrence' for each user now. The final result after doing the max would result in a RDD that looked like this:
[[(u'u1', u's1', 20)], [(u'u2', u's2', 10)]]
Where only the max dataset would remain for each of the users in the file. In other words, I want to change the value of the RDD to contain only a single triplet the each users max occurrences.