16

I've got a list of column names I want to sum

columns = ['col1','col2','col3']

How can I add the three and put it in a new column ? (in an automatic way, so that I can change the column list and have new results)

Dataframe with result I want:

col1   col2   col3   result
 1      2      3       6
blackbishop
  • 30,945
  • 11
  • 55
  • 76
Manrique
  • 2,083
  • 3
  • 15
  • 38
  • Possible duplicate of [How do I add a new column to a Spark DataFrame (using PySpark)?](https://stackoverflow.com/questions/33681487/how-do-i-add-a-new-column-to-a-spark-dataframe-using-pyspark) – Prasad Khode Nov 14 '18 at 10:23
  • Thanks for answering ! I know how to add columns, i just want an efficient way to add them based on a list with column names. – Manrique Nov 14 '18 at 10:33

3 Answers3

31
[TL;DR,]

You can do this:

from functools import reduce
from operator import add
from pyspark.sql.functions import col

df.na.fill(0).withColumn("result" ,reduce(add, [col(x) for x in df.columns]))

Explanation:

The df.na.fill(0) portion is to handle nulls in your data. If you don't have any nulls, you can skip that and do this instead:

df.withColumn("result" ,reduce(add, [col(x) for x in df.columns]))

If you have static list of columns, you can do this:

df.withColumn("result", col("col1") + col("col2") + col("col3"))

But if you don't want to type the whole columns list, you need to generate the phrase col("col1") + col("col2") + col("col3") iteratively. For this, you can use the reduce method with add function to get this:

reduce(add, [col(x) for x in df.columns])

The columns are added two at a time, so you would get col(col("col1") + col("col2")) + col("col3") instead of col("col1") + col("col2") + col("col3"). But the effect would be same.

The col(x) ensures that you are getting col(col("col1") + col("col2")) + col("col3") instead of a simple string concat (which generates (col1col2col3).

11

Try this:

df = df.withColumn('result', sum(df[col] for col in df.columns))

df.columns will be list of columns from df.

Mayank Porwal
  • 33,470
  • 8
  • 37
  • 58
  • 3
    I have replicate the same with below dataframe and getting an error: listA = [(10,20,40,60),(10,10,10,40)] df = spark.createDataFrame(listA, ['M1','M2','M3','M4']) newdf = df.withColumn('result', sum(df[col] for col in df.columns)) Please see below error. TypeError: 'Column' object is not callable. Am I missing something?? – vikrant rana Dec 04 '18 at 14:38
11

Add multiple columns from a list into one column

I tried a lot of methods and the following are my observations:

  1. PySpark's sum function doesn't support column addition (Pyspark version 2.3.1)
  2. Built-in python's sum function is working for some folks but giving error for others.

So, the addition of multiple columns can be achieved using the expr function in PySpark, which takes an expression to be computed as an input.

from pyspark.sql.functions import expr

cols_list = ['a', 'b', 'c']

# Creating an addition expression using `join`
expression = '+'.join(cols_list)

df = df.withColumn('sum_cols', expr(expression))

This gives us the desired sum of columns. We can also use any other complex expression to get other output.

Vivek Payasi
  • 569
  • 5
  • 9