0

Finding cumulative summations or, means are very very common operations in data analysis and yet in pyspark all the solutions that I see online tend to bring all the data in one partition which would not work for really large datasets.

from pyspark.sql import SparkSession
from pyspark.sql.window import Window
from pyspark.sql.functions import col, sum

# Step 1: Create a SparkSession
spark = SparkSession.builder.getOrCreate()

# Step 2: Create a DataFrame
df = spark.createDataFrame([
    (1, "Alice", 100),
    (2, "Bob", 200),
    (3, "Charlie", 150),
    (4, "David", 300),
    (5, "Eve", 250)
], ["ID", "Name", "Value"])

# Step 3: Sort the DataFrame and compute cumulative sum
window_spec = Window.orderBy("ID").rangeBetween(Window.unboundedPreceding, 0)
df = df.withColumn("CumulativeSum", sum(col("Value")).over(window_spec))

# Step 4: Show the DataFrame with cumulative sum
df.show()

How are cumulative aggregations handled in pyspark?

figs_and_nuts
  • 4,870
  • 2
  • 31
  • 56
  • What do you mean by _tend to bring all the data in one partition_? Spark is a distributed query engine and I am pretty certain that the windowed `sum` would also be calculated in a distributed fashion – botchniaque Jul 15 '23 at 11:20
  • If you try running the code that I have posted above you will see the warnings that I am referring to. If you can find a way to add cumulations in a distributed fashion please post it as an answer. That would be of great help – figs_and_nuts Jul 15 '23 at 12:32
  • Need to do your own via rdds – thebluephantom Jul 15 '23 at 21:05
  • https://stackoverflow.com/questions/35154267/how-to-compute-cumulative-sum-using-spark Try this – thebluephantom Jul 16 '23 at 21:49
  • Does this answer your question? [How to compute cumulative sum using Spark](https://stackoverflow.com/questions/35154267/how-to-compute-cumulative-sum-using-spark) – thebluephantom Jul 17 '23 at 10:17

0 Answers0