Avarage per group in PySpark

Question

I have PySpark dataframe below:

cust |  amount |  
----------------
A    |  5      |  
A    |  1      |
A    |  3      |     
B    |  4      |     
B    |  4      |     
B    |  2      |     
C    |  2      |     
C    |  1      |     
C    |  7      |     
C    |  5      |

I need to group by column 'cust' and calculates the average per group.

Expected result:

cust |  avg_amount
-------------------
A    |  3
B    |  3.333
C    |  7.5

I've been using the code as below but giving me the error.

data.withColumn("avg_amount", F.avg("amount"))

Any idea how I can make this average?

how is the amount column determined? it seems to be randomly picked for each cust. — mck, Feb 01 '21 at 11:10
What is the rule for the column `amount` in the expected output? Why it's `3` for group `A` for example? — blackbishop, Feb 01 '21 at 11:11
there's no rule in column ```amount```. It is transactional row. Depends on how many transaction the customer is doing. — mojek, Feb 01 '21 at 11:13
So shouldn't it be 2 for B and 4 for C? They have respectively 2 and 4 transactions in the input table. — blackbishop, Feb 01 '21 at 11:17
Does this answer your question? [Multiple Aggregate operations on the same column of a spark dataframe](https://stackoverflow.com/questions/34954771/multiple-aggregate-operations-on-the-same-column-of-a-spark-dataframe) — Mykola Zotko, Feb 01 '21 at 11:45

score 1 · Accepted Answer · answered Feb 01 '21 at 11:18

Use groupBy to count the number of transactions and the average of amount by customer:

from pyspark.sql import functions as F

data = data.groupBy("cust")\
           .agg(
               F.count("*").alias("amount"),
               F.avg("amount").alias("avg_amount")
           )

Avarage per group in PySpark

1 Answers1