0

I have taken a look at this: How to use Pyspark to calculate average on RDD did not help.

My data is in a text file in the following way

robert 43
daniel 64
andrew 99
jake 56
peter 67
sophia 56
marie 62
--
robert 55
daniel 89
andrew 0
jake 11
peter 0
sophia 67
marie 93

I want to create a rdd file calculate the avg marks for each student and then store it in a df. How do I do it.

The result I want:

FirstName    AvgMarks
robert         22
daniel         20
andrew         50
jake           10
...
Karthik Bhandary
  • 1,305
  • 2
  • 7
  • 16

1 Answers1

1

If you want to use RDDs, you can split the input strings into the name (as key) and the mark (as value) and then follow this approach to calcuate the average:

rdd=spark.sparkContext.textFile("textfile")

def splitLine(l):
    parts=l.split(' ')
    if len(parts) == 2:
        return (parts[0], int(parts[1]))
    else:
        return (l, None)
    
rdd2 = rdd.map(splitLine) \
    .filter(lambda x: x[0] != '--') \
    .mapValues(lambda l: (l, 1)) \
    .reduceByKey(lambda a,b: (a[0]+b[0], a[1]+b[1])) \
    .mapValues(lambda l: l[0]/l[1])

rdd2.foreach(lambda x: print(x))

Output:

('daniel', 76.5)
('peter', 33.5)
('marie', 77.5)
('robert', 49.0)
('andrew', 49.5)
('jake', 33.5)
('sophia', 61.5)

The rdd2 can be used to create a dataframe:

df=spark.createDataFrame(rdd2, ['FirstName', 'AvgMarks'])

But if the goal is to get a dataframe, there is no need to use rdds at all:

from pyspark.sql import functions as F

df=spark.read.option('header', False).option('delimiter', ' ') \
    .schema("FirstName STRING, Mark DOUBLE").csv('textfile') \
    .filter(F.col('FirstName') != F.lit('--')) \
    .groupBy('FirstName').avg('Mark')
werner
  • 13,518
  • 6
  • 30
  • 45