I am beginner to learning Apache Spark. Currently I am trying to learn various aggregations using Python.
To give to some context to the problem I am facing, I am finding it difficult to understand the working of aggregateByKey function to calculate number of orders by "status".
I am following a YouTube playlist by ITVersity and below is the code and some sample output I am working with.
ordersRDD = sc.textFile("/user/cloudera/sqoop_import/orders")
for i in ordersRDD.take(10): print(i)
Output:
1,2013-07-25 00:00:00.0,11599,CLOSED
2,2013-07-25 00:00:00.0,256,PENDING_PAYMENT
3,2013-07-25 00:00:00.0,12111,COMPLETE
4,2013-07-25 00:00:00.0,8827,CLOSED
5,2013-07-25 00:00:00.0,11318,COMPLETE
6,2013-07-25 00:00:00.0,7130,COMPLETE
7,2013-07-25 00:00:00.0,4530,COMPLETE
8,2013-07-25 00:00:00.0,2911,PROCESSING
9,2013-07-25 00:00:00.0,5657,PENDING_PAYMENT
10,2013-07-25 00:00:00.0,5648,PENDING_PAYMENT
ordersMap = ordersRDD.map(lambda x: (x.split(",")[3], x))
Output:
(u'CLOSED', u'1,2013-07-25 00:00:00.0,11599,CLOSED')
(u'PENDING_PAYMENT', u'2,2013-07-25 00:00:00.0,256,PENDING_PAYMENT')
(u'COMPLETE', u'3,2013-07-25 00:00:00.0,12111,COMPLETE')
(u'CLOSED', u'4,2013-07-25 00:00:00.0,8827,CLOSED')
(u'COMPLETE', u'5,2013-07-25 00:00:00.0,11318,COMPLETE')
(u'COMPLETE', u'6,2013-07-25 00:00:00.0,7130,COMPLETE')
(u'COMPLETE', u'7,2013-07-25 00:00:00.0,4530,COMPLETE')
(u'PROCESSING', u'8,2013-07-25 00:00:00.0,2911,PROCESSING')
(u'PENDING_PAYMENT', u'9,2013-07-25 00:00:00.0,5657,PENDING_PAYMENT')
(u'PENDING_PAYMENT', u'10,2013-07-25 00:00:00.0,5648,PENDING_PAYMENT')
ordersByStatus = ordersMap.aggregateByKey(0, lambda acc, val: acc + 1, lambda acc,val: acc + val)
for i in ordersByStatus.take(10): print(i)
Final Output:
(u'SUSPECTED_FRAUD', 1558)
(u'CANCELED', 1428)
(u'COMPLETE', 22899)
(u'PENDING_PAYMENT', 15030)
(u'PENDING', 7610)
(u'CLOSED', 7556)
(u'ON_HOLD', 3798)
(u'PROCESSING', 8275)
(u'PAYMENT_REVIEW', 729)
The questions I have difficulty understanding are:
1. Why does the aggregateByKey function taken in 2 lambda functions as parameters?
2. Vizualize what the first lambda function does?
3. Vizualize what the second lambda function does?
It would be very helpful if you could explain me the above questions and also workings of aggregateByKey with some simple block diagram if possible? Perhaps a few intermediate calculation?
Appreciate your help!
Thanks,
Shiv