2

There is a scenario of finding the sum of rows in a DF as follows

ID DEPT [..] SUB1 SUB2 SUB3 SUB4  **SUM1**
1  PHY      50    20   30   30   130
2  COY      52    62   63   34   211
3  DOY      53    52   53   84
4  ROY      56    52   53   74
5  SZY      57    62   73   54

Need to find row sum of SUB1 SUB2 SUB3 SUB4 for each rows and make as a new column SUM1. The ordinal position of the column SUB1 in the data frame is 16.

Alex Ott
  • 80,552
  • 8
  • 87
  • 132
user1254579
  • 3,901
  • 21
  • 65
  • 104

1 Answers1

6

You can use the Python sum to add up the columns:

import pyspark.sql.functions as F

col_list = ['SUB1', 'SUB2', 'SUB3', 'SUB4']
# or col_list = df.columns[16:20]

df2 = df.withColumn(
    'SUM1',
    sum([F.col(c) for c in col_list])
)
mck
  • 40,932
  • 13
  • 35
  • 50
  • Thank you . There are 106 columns to be sumed.It work well with fewer than 100 columns.But for more than 100 columns it shows the following error org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Max iterations (100) reached for batch Resolution, please set 'spark.sql.analyzer.maxIterations' to a larger value., tree: – user1254579 Feb 20 '21 at 15:27
  • try this spark.sql.optimizer.maxIterations 100 ? – user1254579 Feb 20 '21 at 15:29
  • 1
    maybe set it to a larger value, e.g. 200/1000 . using `spark.sql("set spark.sql.analyzer.maxIterations = 200")`. – mck Feb 20 '21 at 15:30