In theory, I think I understand the way that aggregate
works, but I can't get past a very simple example.
Notably, the example here seems to have the wrong result. When I run the following example on my machine i get.....
seqOp = (lambda x, y: (x[0] + y, x[1] + 1))
combOp = (lambda x, y: (x[0] + y[0], x[1] + y[1]))
ag = sc.parallelize([1, 2, 3, 4]).aggregate((1,0), seqOp, combOp)
Then, the result I get is
>>> ag
(12, 4)
But, the link I cited says that the result is (19, 4)
. This guy is using a different version of spark, (1.2.0)
. I'm using 1.5.2.
Did the aggregate function change between the versions of Spark?
If the answer is NO, then it is still baffling how 12
is the first element in that tuple. Examining just the first element of the tuple, we can see that
y
is added to the first element of the tuple for every element in the RDD.
So, starting with (1,0), and since y
is 1, 2, 3, 4,
respectively, this should result in a series of tuples like: (2,1), (3,1), (4,1), (5,1)
. Now, when I add the first elements in the series of tuples, I get 14
? Is there something obvious I'm missing for how to get 12
? Thanks much.