I have data like below. Filename:babynames.csv.
year name percent sex
1880 John 0.081541 boy
1880 William 0.080511 boy
1880 James 0.050057 boy
I need to sort the input based on year and sex and I want the output aggregated like below (this output is to be assigned to a new RDD).
year sex avg(percentage) count(rows)
1880 boy 0.070703 3
I am not sure how to proceed after the following step in pyspark. Need your help on this
testrdd = sc.textFile("babynames.csv");
rows = testrdd.map(lambda y:y.split(',')).filter(lambda x:"year" not in x[0])
aggregatedoutput = ????