How do I find the minimum date for each unique key in an RDD with PySpark?

Question

I have a list of tuples of the format, [(ID, Date), (ID, Date)...], with dates in datetime format. As an example of the RDD I'm working with:

[('1', datetime.date(2012, 1, 01)),
 ('2', datetime.date(2012, 1, 01)),
 ('3', datetime.date(2012, 1, 01)),
 ('4', datetime.date(2012, 1, 01)),
 ('5', datetime.date(2012, 1, 01)),
 ('1', datetime.date(2011, 1, 01)),
 ('2', datetime.date(2013, 1, 01)),
 ('3', datetime.date(2015, 1, 01)),
 ('4', datetime.date(2010, 1, 01)),
 ('5', datetime.date(2018, 1, 01))]

I need to gather the IDs and the minimum date associated with each ID. Presumably, this is a reduceByKey action, but I've not been able to sort out the associated function. I'm guessing I'm just over-complicating things, but help would be appreciated in identifying the appropriate lambda (or method if reduceByKey is not most efficient in this scenario).

I've scoured StackOverflow and found similar answers here, here, and here, but again, I've not been able to successfully modify these answers to fit my particular scenario. Often times, the datetime format seems to trip things up (the datetime format itself is due to the way I parsed the xml, so I can go back and have it parsed as a string if that's helpful).

I've attempted the following, and receive errors for each:

.reduceByKey(min) - IndexError: tuple index out of range

reduceByKey(lambda x, y: (x, min(y))) - IndexError: tuple index out of range (if datetime is converted to string, or error below if in datetime format)

.reduceByKey(lambda x, y: (x[0], min(y))) - TypeError: 'datetime.date' object is not subscriptable

I expect the final result to be as follows:

[('1', datetime.date(2011, 1, 01)),
 ('2', datetime.date(2012, 1, 01)),
 ('3', datetime.date(2012, 1, 01)),
 ('4', datetime.date(2010, 1, 01)),
 ('5', datetime.date(2012, 1, 01))]

@jxc - `IndexError: tuple index out of range` when I try to `.take()`. — alofgran, Nov 12 '19 at 21:28
@jxc - I understand them to be key, value pairs. What exactly designates them as such in Spark though? I thought the key was always the first element and the value the 2nd in a 2-element (1-tuple/row) RDD... — alofgran, Nov 12 '19 at 21:35
As indicated here: https://stackoverflow.com/questions/35703298/how-to-determine-if-object-is-a-valid-key-value-pair-in-pyspark. Correct me if I'm misunderstanding though. — alofgran, Nov 12 '19 at 21:47
you are right, in a Pair RDD, the first item of the tuple is always the key and 2nd as value. in reduceByKey(lambda x, y:), both x and y are values, so you should not use x[0] or x[1] etc. both x and y are datetime.date objects. — jxc, Nov 12 '19 at 21:51
type is listed as `PipelinedRDD` though for the whole object. — alofgran, Nov 12 '19 at 21:54

score 0 · Accepted Answer · answered Nov 14 '19 at 21:22

I figured it out. There were several problems. For starters, here's the applicable syntax. First off (and after creating a SparkSession of course), I converted the RDD to a dataframe with:

df = spark.createDataFrame(df, ['col1', 'col2'])

And then carried out a groupBy and aggregation function. These you'll see on other SO answers, but I thought I'd post this here since it's in the context of my particular scenario.

from pyspark.sql import functions as F
df= df.groupBy('col1').agg(F.min('col2'))

To then return the data to the RDD format, I used

result = df.rdd.map(lambda x: (x[0], x[4]))

In this particular case, I was also mapped the elements of the 0th and 4th columns of the dataframe back to a tuple assigned to result.

In this process, I also discovered some interesting bits that may be helpful to others:

I had some Null values in my dataframe that I wasn't aware of that continually caused, mostly "NoneType is not subscriptible" errors. While this made sense, it took me a while to figure out where the NoneType was located.
Some of my XML had been incorrectly parsed so that it was returning a tuple of (None) instead of a tuple of (None, None) as required by the format of data above.

These corrections allowed me to .show() the dataframes (and not just .printSchema(). The .groupBy and its associated objects were never a problem.

@jxc's answer in the comments may have worked after I resolved the 'None' issue described in this answer, but I didn't get around to trying it out. — alofgran, Nov 14 '19 at 21:23

How do I find the minimum date for each unique key in an RDD with PySpark?

1 Answers1