I am trying to have the output of my reduceByKey function, using pyspark, to be the range of the integers passed through with respect to the key.
I try to make a custom function:
def _range(x,y):
return [max(x,y), min(x,y)]
data2 = data_.map(lambda x: (x[u'driverId'] + ',' + x[u'afh'], int(x['timestamp'])))
.reduceByKey(lambda x,y: _range(x,y))
of course the output comes out as lists within lists within lists
i know a solution would be
.reduceByKey(max)
followed by
.reduceByKey(min)
^^^^and then combining them, but i do NOT want to perform two operations
but i would like to do this in one pass so the application is not inefficient. I would also like to avoid first populating a list of integers. any ideas? the data is in an RDD. thanks