I have a DataFrame containing 752 (id,date and 750 feature columns) columns and around 1.5 million rows and I need to apply cumulative sum on all 750 feature columns partition by id and order by date.
Below is the approach I am following currently:
# putting all 750 feature columns in a list
required_columns = ['ts_1','ts_2'....,'ts_750']
# defining window
sumwindow = Window.partitionBy('id').orderBy('date')
# Applying window to calculate cumulative of each individual feature column
for current_col in required_columns:
new_col_name = "sum_{0}".format(current_col)
df=df.withColumn(new_col_name,sum(col(current_col)).over(sumwindow))
# Saving the result into parquet file
df.write.format('parquet').save(output_path)
I am getting below error while running this current approach
py4j.protocol.Py4JJavaError: An error occurred while calling o2428.save. : java.lang.StackOverflowError
Please let me know alternate solution for the same. seems like cumulative sum is bit tricky with large amount of data. Please suggest any alternate approach or any spark configurations which I can tune to make it work.