I've written a program in Python and pandas which takes a very large dataset (~4 million rows per month for 6 months), groups it by 2 of the columns (date and a label), and then applies a function to each group of rows. There are a variable number of rows in each grouping - anywhere from a handful of rows to thousands of rows. There are thousands of groups per month (label-date combos).
My current program uses multiprocessing, so it's pretty efficient, and I thought would map well to Spark. I've worked with map-reduce before, but am having trouble implementing this in Spark. I'm sure I'm missing some concept in the pipelining, but everything I've read appears to focus on key-value processing, or splitting a distributed dataset by arbitrary partitions, rather than what I'm trying to do. Is there a simple example or paradigm for doing this? Any help would be greatly appreciated.
EDIT: Here's some pseudo-code for what I'm currently doing:
reader = pd.read_csv()
pool = mp.Pool(processes=4)
labels = <list of unique labels>
for label in labels:
dates = reader[(reader.label == label)]
for date in dates:
df = reader[(reader.label==label) && (reader.date==date)]
pool.apply_async(process, df, callback=callbackFunc)
pool.close()
pool.join()
When I say asynchronous, I mean something analogous to pool.apply_async().