For example, here is my test data
test = spark.createDataFrame([
(0, 1, 5, "2018-06-03", "Region A"),
(1, 1, 2, "2018-06-04", "Region B"),
(2, 2, 1, "2018-06-03", "Region B"),
(3, 3, 1, "2018-06-01", "Region A"),
(3, 1, 3, "2018-06-05", "Region A"),
])\
.toDF("orderid", "customerid", "price", "transactiondate", "location")
test.show()
I can get summariztion data like this
test.groupBy("customerid", "location").agg(sum("price")).show()
but I also want the percentage data, something like this
+----------+--------+----------+
|customerid|location|sum(price)| percentage
+----------+--------+----------+
| 1|Region B| 2| 20%
| 1|Region A| 8| 80%
| 3|Region A| 1| 100%
| 2|Region B| 1| 100%
+----------+--------+----------+
I want to know
- How can I do it? maybe using a window function?
- Can I pivot table turn it into something like this? (with percentage and sum column)
I only find a pandas example at How to get percentage of counts of a column after groupby in Pandas
UPDATE:
From the help of @Gordon Linoff, I can get the percentage by
from pyspark.sql.window import Window
test.groupBy("customerid", "location").agg(sum("price"))\
.withColumn("percentage", col("sum(price)")/sum("sum(price)").over(Window.partitionBy(test['customerid']))).show()