I've got a DataFrame I'm operating on, and I want to group by a set of columns and operate per-group on the rest of the columns. In regular RDD
-land I think it would look something like this:
rdd.map( tup => ((tup._1, tup._2, tup._3), tup) ).
groupByKey().
forEachPartition( iter => doSomeJob(iter) )
In DataFrame
-land I'd start like this:
df.groupBy("col1", "col2", "col3") // Reference by name
but then I'm not sure how to operate on the groups if my operations are more complicated than the mean/min/max/count offered by GroupedData.
For example, I want to build a single MongoDB document per ("col1", "col2", "col3")
group (by iterating through the associated Row
s in the group), scale down to N
partitions, then insert the docs into a MongoDB database. The N
limit is the max number of simultaneous connections I want.
Any advice?