Row Sum of a each row in a Dataframe using Pyspark

Question

There is a scenario of finding the sum of rows in a DF as follows

ID DEPT [..] SUB1 SUB2 SUB3 SUB4  **SUM1**
1  PHY      50    20   30   30   130
2  COY      52    62   63   34   211
3  DOY      53    52   53   84
4  ROY      56    52   53   74
5  SZY      57    62   73   54

Need to find row sum of SUB1 SUB2 SUB3 SUB4 for each rows and make as a new column SUM1. The ordinal position of the column SUB1 in the data frame is 16.

score 6 · Accepted Answer · answered Feb 20 '21 at 15:04

6

You can use the Python sum to add up the columns:

import pyspark.sql.functions as F

col_list = ['SUB1', 'SUB2', 'SUB3', 'SUB4']
# or col_list = df.columns[16:20]

df2 = df.withColumn(
    'SUM1',
    sum([F.col(c) for c in col_list])
)

answered Feb 20 '21 at 15:04

mck

40,932
13
35
50

Thank you . There are 106 columns to be sumed.It work well with fewer than 100 columns.But for more than 100 columns it shows the following error org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Max iterations (100) reached for batch Resolution, please set 'spark.sql.analyzer.maxIterations' to a larger value., tree: – user1254579 Feb 20 '21 at 15:27
try this spark.sql.optimizer.maxIterations 100 ? – user1254579 Feb 20 '21 at 15:29
1

maybe set it to a larger value, e.g. 200/1000 . using `spark.sql("set spark.sql.analyzer.maxIterations = 200")`. – mck Feb 20 '21 at 15:30

Row Sum of a each row in a Dataframe using Pyspark

1 Answers1