Pyspark - How to calculate the average on the text data

Question

I have taken a look at this: How to use Pyspark to calculate average on RDD did not help.

My data is in a text file in the following way

robert 43
daniel 64
andrew 99
jake 56
peter 67
sophia 56
marie 62
--
robert 55
daniel 89
andrew 0
jake 11
peter 0
sophia 67
marie 93

I want to create a rdd file calculate the avg marks for each student and then store it in a df. How do I do it.

The result I want:

FirstName    AvgMarks
robert         22
daniel         20
andrew         50
jake           10
...

werner · Accepted Answer · 2023-06-30T18:56:25.750

If you want to use RDDs, you can split the input strings into the name (as key) and the mark (as value) and then follow this approach to calcuate the average:

rdd=spark.sparkContext.textFile("textfile")

def splitLine(l):
    parts=l.split(' ')
    if len(parts) == 2:
        return (parts[0], int(parts[1]))
    else:
        return (l, None)
    
rdd2 = rdd.map(splitLine) \
    .filter(lambda x: x[0] != '--') \
    .mapValues(lambda l: (l, 1)) \
    .reduceByKey(lambda a,b: (a[0]+b[0], a[1]+b[1])) \
    .mapValues(lambda l: l[0]/l[1])

rdd2.foreach(lambda x: print(x))

Output:

('daniel', 76.5)
('peter', 33.5)
('marie', 77.5)
('robert', 49.0)
('andrew', 49.5)
('jake', 33.5)
('sophia', 61.5)

The rdd2 can be used to create a dataframe:

df=spark.createDataFrame(rdd2, ['FirstName', 'AvgMarks'])

But if the goal is to get a dataframe, there is no need to use rdds at all:

from pyspark.sql import functions as F

df=spark.read.option('header', False).option('delimiter', ' ') \
    .schema("FirstName STRING, Mark DOUBLE").csv('textfile') \
    .filter(F.col('FirstName') != F.lit('--')) \
    .groupBy('FirstName').avg('Mark')

How do I convert the output into a df? – Karthik Bhandary Jun 28 '23 at 07:47 — Karthik Bhandary, Jun 28 '23 at 07:47

Pyspark - How to calculate the average on the text data

1 Answers1