0

I have PySpark dataframe below:

cust |  amount |  
----------------
A    |  5      |  
A    |  1      |
A    |  3      |     
B    |  4      |     
B    |  4      |     
B    |  2      |     
C    |  2      |     
C    |  1      |     
C    |  7      |     
C    |  5      |    

I need to group by column 'cust' and calculates the average per group.

Expected result:

cust |  avg_amount
-------------------
A    |  3
B    |  3.333
C    |  7.5

I've been using the code as below but giving me the error.

data.withColumn("avg_amount", F.avg("amount"))

Any idea how I can make this average?

Mykola Zotko
  • 15,583
  • 3
  • 71
  • 73
mojek
  • 195
  • 1
  • 9
  • how is the amount column determined? it seems to be randomly picked for each cust. – mck Feb 01 '21 at 11:10
  • What is the rule for the column `amount` in the expected output? Why it's `3` for group `A` for example? – blackbishop Feb 01 '21 at 11:11
  • there's no rule in column ```amount```. It is transactional row. Depends on how many transaction the customer is doing. – mojek Feb 01 '21 at 11:13
  • So shouldn't it be 2 for B and 4 for C? They have respectively 2 and 4 transactions in the input table. – blackbishop Feb 01 '21 at 11:17
  • yes, let me edit my question – mojek Feb 01 '21 at 11:18
  • 1
    Does this answer your question? [Multiple Aggregate operations on the same column of a spark dataframe](https://stackoverflow.com/questions/34954771/multiple-aggregate-operations-on-the-same-column-of-a-spark-dataframe) – Mykola Zotko Feb 01 '21 at 11:45

1 Answers1

1

Use groupBy to count the number of transactions and the average of amount by customer:

from pyspark.sql import functions as F

data = data.groupBy("cust")\
           .agg(
               F.count("*").alias("amount"),
               F.avg("amount").alias("avg_amount")
           )
blackbishop
  • 30,945
  • 11
  • 55
  • 76