Pyspark: obtain percentage result after groupBy

Question

For example, here is my test data

test = spark.createDataFrame([
    (0, 1, 5, "2018-06-03", "Region A"),
    (1, 1, 2, "2018-06-04", "Region B"),
    (2, 2, 1, "2018-06-03", "Region B"),
    (3, 3, 1, "2018-06-01", "Region A"),
    (3, 1, 3, "2018-06-05", "Region A"),
])\
  .toDF("orderid", "customerid", "price", "transactiondate", "location")
test.show()

I can get summariztion data like this

test.groupBy("customerid", "location").agg(sum("price")).show()

but I also want the percentage data, something like this

+----------+--------+----------+ 
|customerid|location|sum(price)| percentage
+----------+--------+----------+ 
|         1|Region B|         2|    20%
|         1|Region A|         8|    80%
|         3|Region A|         1|    100%
|         2|Region B|         1|    100%
+----------+--------+----------+

I want to know

How can I do it? maybe using a window function?
Can I pivot table turn it into something like this? (with percentage and sum column)

I only find a pandas example at How to get percentage of counts of a column after groupby in Pandas

UPDATE:

From the help of @Gordon Linoff, I can get the percentage by

from pyspark.sql.window import Window
test.groupBy("customerid", "location").agg(sum("price"))\
  .withColumn("percentage", col("sum(price)")/sum("sum(price)").over(Window.partitionBy(test['customerid']))).show()

score 4 · Answer 1 · answered Apr 20 '21 at 20:05

Here is a clean code for your problem:

from pyspark.sql import functions as F
from pyspark.sql.window import Window

(test.groupby("customerid", "location")
      .agg(F.sum("price").alias("t_price"))
      .withColumn("perc", F.col("t_price") / F.sum("t_price").over(Window.partitionBy("customerid")))

Gordon Linoff · Accepted Answer · 2018-08-22T11:54:06.867

2

THIS ANSWERS THE ORIGINAL VERSION OF THE QUESTION.

In SQL, you can use window functions:

select customerid, location, sum(price),
       (sum(price) / sum(sum(price)) over (partition by customerid) as ratio
from t
group by customerid, location;

edited Aug 22 '18 at 11:54

answered Aug 22 '18 at 10:30

Gordon Linoff

1,242,037
58
646
786

Hi, I'm able to get it work now. Great thanks to you. Addtionally, can I pivot the result? I've update my question – ZK Zhao Aug 22 '18 at 10:46
@cqcn1991 . . . New questions should be asked as *new* questions, not by editing existing questions -- and even invalidating answers to them. – Gordon Linoff Aug 22 '18 at 11:54

Pyspark: obtain percentage result after groupBy

2 Answers2