I have a list of tuples of the format, [(ID, Date), (ID, Date)...], with dates in datetime format. As an example of the RDD I'm working with:
[('1', datetime.date(2012, 1, 01)),
('2', datetime.date(2012, 1, 01)),
('3', datetime.date(2012, 1, 01)),
('4', datetime.date(2012, 1, 01)),
('5', datetime.date(2012, 1, 01)),
('1', datetime.date(2011, 1, 01)),
('2', datetime.date(2013, 1, 01)),
('3', datetime.date(2015, 1, 01)),
('4', datetime.date(2010, 1, 01)),
('5', datetime.date(2018, 1, 01))]
I need to gather the IDs and the minimum date associated with each ID. Presumably, this is a reduceByKey
action, but I've not been able to sort out the associated function. I'm guessing I'm just over-complicating things, but help would be appreciated in identifying the appropriate lambda (or method if reduceByKey
is not most efficient in this scenario).
I've scoured StackOverflow and found similar answers here, here, and here, but again, I've not been able to successfully modify these answers to fit my particular scenario. Often times, the datetime format seems to trip things up (the datetime format itself is due to the way I parsed the xml, so I can go back and have it parsed as a string if that's helpful).
I've attempted the following, and receive errors for each:
.reduceByKey(min)
- IndexError: tuple index out of range
reduceByKey(lambda x, y: (x, min(y)))
- IndexError: tuple index out of range (if datetime is converted to string, or error below if in datetime format)
.reduceByKey(lambda x, y: (x[0], min(y)))
- TypeError: 'datetime.date' object is not subscriptable
I expect the final result to be as follows:
[('1', datetime.date(2011, 1, 01)),
('2', datetime.date(2012, 1, 01)),
('3', datetime.date(2012, 1, 01)),
('4', datetime.date(2010, 1, 01)),
('5', datetime.date(2012, 1, 01))]