I can't get the correct average using reduceByKey() in PySpark

Question

I am learning PySpark. I have been trying to get the average weight by 'sex' (male ('M'), female('F')) using the reduceByKey() transformation in a key/value RDD.

The code I am using is:

'''

def get_mean(*args):
    return np.sum(args)/len(args)

mean_weight = sc.textFile('rio_2016.csv')\
    .map(lambda x: x.split(','))\
    .filter(lambda x: not x[0].startswith('*'))\
    .map(lambda x: (x[3], float(x[6])))\
    .reduceByKey(get_mean)

''' The wrong values I am getting form this code are:

[('M', 70.53506980749627), ('F', 67.99280032604982)]

The correct values I got using pandas are: F 64.821096 M 82.411652

The Female/Male counts and total average (male and female) do match between pandas and Pyspark. The only thing I can't get right is the average by 'sex'.

if you're learning pyspark, i'd advice you to learn dataframes, not rdd ... — Steven, Aug 04 '21 at 08:14

score 1 · Answer 1 · answered Aug 04 '21 at 06:17

We can solve this kind of problems using functions like reduceByKey, groupByKey, aggregateByKey methods on pairRDD.

Here is the solution using one of the recommended method to get average on pair rdd - reduceByKey. Try using other methods also, you can read the differences among those methods here.

from pyspark.sql import *

spark = SparkSession.builder.master("local").getOrCreate()

# Sample dataframe
in_values = [('M', 80.5), ('M', 90), ('F', 70), ('M', 75.5), ('F', 85), ('M', 60)]

rdd = spark.sparkContext.parallelize(in_values)

avg_rdd = rdd.mapValues(lambda v: (v, 1)) \
    .reduceByKey(lambda x, y: (x[0] + y[0], x[1] + y[1])) \
    .mapValues(lambda v: v[0] / v[1]) \
    .collect()

print(avg_rdd)

# [('M', 76.5), ('F', 77.5)]

I can't get the correct average using reduceByKey() in PySpark

1 Answers1