Finding cumulative summations or, means are very very common operations in data analysis and yet in pyspark all the solutions that I see online tend to bring all the data in one partition which would not work for really large datasets.
from pyspark.sql import SparkSession
from pyspark.sql.window import Window
from pyspark.sql.functions import col, sum
# Step 1: Create a SparkSession
spark = SparkSession.builder.getOrCreate()
# Step 2: Create a DataFrame
df = spark.createDataFrame([
(1, "Alice", 100),
(2, "Bob", 200),
(3, "Charlie", 150),
(4, "David", 300),
(5, "Eve", 250)
], ["ID", "Name", "Value"])
# Step 3: Sort the DataFrame and compute cumulative sum
window_spec = Window.orderBy("ID").rangeBetween(Window.unboundedPreceding, 0)
df = df.withColumn("CumulativeSum", sum(col("Value")).over(window_spec))
# Step 4: Show the DataFrame with cumulative sum
df.show()
How are cumulative aggregations handled in pyspark?