I am learning PySpark. I have been trying to get the average weight by 'sex' (male ('M'), female('F')) using the reduceByKey() transformation in a key/value RDD.
The code I am using is:
'''
def get_mean(*args):
return np.sum(args)/len(args)
mean_weight = sc.textFile('rio_2016.csv')\
.map(lambda x: x.split(','))\
.filter(lambda x: not x[0].startswith('*'))\
.map(lambda x: (x[3], float(x[6])))\
.reduceByKey(get_mean)
''' The wrong values I am getting form this code are:
[('M', 70.53506980749627), ('F', 67.99280032604982)]
The correct values I got using pandas are: F 64.821096 M 82.411652
The Female/Male counts and total average (male and female) do match between pandas and Pyspark. The only thing I can't get right is the average by 'sex'.