Why aggregateByKey does not return the right result?

Question

I have the following pair RDD

val pairRDD = prFilterRDD.map(x => (x(1), x(4).toFloat))

 x(1) = categoryid
 x(4).toFloat = price  

Result:

    (2,124.99)
    (17,129.99)
    (2,129.99)
    (17,199.99)
    (17,299.99)
    (17,299.99)
    (2,139.99)
    (17,149.99)
    (17,119.99)
    (17,399.99)
    (3,189.99)
    (17,119.99)
    (3,159.99)
    (18,129.99)
    (18,189.99)
    (3,199.99)
    (18,134.99)
    (18,129.99)
    (18,129.99)
    (18,139.99)
    (3,149.99)
    (18,129.99)
    (3,159.99)
    (18,124.99)
    (4,299.98)
    (18,129.99)

I would like to calculate the sum of the price by categoryid. I write the following the code:

    val initialVal = 0.0f

    val comb =(initialVal: Float, strVal:Float) => initialVal+ strVal

    val mergeValSum= (v1:Float, v2:Float) => v1+v2

    val output = pairRDD.aggregateByKey(initialVal)(comb, mergeValSum)

I have the following result :

 (4,5689.7803)
 (8,1184.95)
 (19,1799.87)
 (48,6599.831)
 (51,1499.93)
 (22,3114.95)
 (33,587.97003)
 (44,1744.8999)
 (11,5619.8115)
 (49,2789.89)
 (5,2314.89)

I don't have the expected result. For example for category-id = 8, the expected result =792.0 and I have 1184.95. Do I use aggregateByKey correctly ?

Thank you for your answer.

In your sample of data you don't have row with category-id is 8. Can you provide real sample of data? I mean input data. Not your result. Now your result doesn't match with input. — Boris Azanov, Oct 25 '20 at 19:29
Thank you for your answer. It helps me.The input is: (8,59.98) (8,28.0) (8,21.99) (8,29.99) (8,24.97) (8,99.98) (8,69.99) (8,34.99) (8,45.0) (8,29.99) (8,39.99) (8,31.97) (8,44.99) (8,69.99) (8,45.0) (8,21.99) (8,34.99) (8,25.99) (8,32.0) My solution is correct. There is no error. The expected result is wrong. — fleur, Oct 26 '20 at 14:01
@BorisAzavov, is it possible to calculate avg price by category id with aggregateByKey ? — fleur, Oct 26 '20 at 14:07
This is how you correctly calculate averages by using aggregateByKey : https://stackoverflow.com/questions/29930110/calculating-the-averages-for-each-key-in-a-pairwise-k-v-rdd-in-spark-with-pyth My question is why aren't using groupby to do this operation? — user238607, Oct 26 '20 at 14:46
@user238607 Because the performance is better with aggregateByKey than groupByKey. Do you agree ? — fleur, Oct 26 '20 at 16:02
@fleur : If you were to use dataframes and groupby operations with average aggregation, then the Spark's Catalyst optimizer will optimize the query for you. Do you think you can do a better job than than? You can see the physical plan for both the approaches and see if they differe in other ways. — user238607, Oct 29 '20 at 18:44

Why aggregateByKey does not return the right result?

0 Answers0