Adding a column of rowsums across a list of columns in Spark Dataframe

Question

I have a Spark dataframe with several columns. I want to add a column on to the dataframe that is a sum of a certain number of the columns.

For example, my data looks like this:

ID var1 var2 var3 var4 var5
a   5     7    9    12   13
b   6     4    3    20   17
c   4     9    4    6    9
d   1     2    6    8    1

I want a column added summing the rows for specific columns:

ID var1 var2 var3 var4 var5   sums
a   5     7    9    12   13    46
b   6     4    3    20   17    50
c   4     9    4    6    9     32
d   1     2    6    8    10    27

I know it is possible to add columns together if you know the specific columns to add:

val newdf = df.withColumn("sumofcolumns", df("var1") + df("var2"))

But is it possible to pass a list of column names and add them together? Based off of this answer which is basically what I want but it is using the python API instead of scala (Add column sum as new column in PySpark dataframe) I think something like this would work:

//Select columns to sum
val columnstosum = ("var1", "var2","var3","var4","var5")

// Create new column called sumofcolumns which is sum of all columns listed in columnstosum
val newdf = df.withColumn("sumofcolumns", df.select(columstosum.head, columnstosum.tail: _*).sum)

This throws the error value sum is not a member of org.apache.spark.sql.DataFrame. Is there a way to sum across columns?

Thanks in advance for your help.

score 47 · Accepted Answer · edited Feb 04 '19 at 18:10

You should try the following:

import org.apache.spark.sql.functions._

val sc: SparkContext = ...
val sqlContext = new SQLContext(sc)

import sqlContext.implicits._

val input = sc.parallelize(Seq(
  ("a", 5, 7, 9, 12, 13),
  ("b", 6, 4, 3, 20, 17),
  ("c", 4, 9, 4, 6 , 9),
  ("d", 1, 2, 6, 8 , 1)
)).toDF("ID", "var1", "var2", "var3", "var4", "var5")

val columnsToSum = List(col("var1"), col("var2"), col("var3"), col("var4"), col("var5"))

val output = input.withColumn("sums", columnsToSum.reduce(_ + _))

output.show()

Then the result is:

+---+----+----+----+----+----+----+
| ID|var1|var2|var3|var4|var5|sums|
+---+----+----+----+----+----+----+
|  a|   5|   7|   9|  12|  13|  46|
|  b|   6|   4|   3|  20|  17|  50|
|  c|   4|   9|   4|   6|   9|  32|
|  d|   1|   2|   6|   8|   1|  18|
+---+----+----+----+----+----+----+

zero323 · Answer 2 · 2016-06-04T11:13:32.537

Plain and simple:

import org.apache.spark.sql.Column
import org.apache.spark.sql.functions.{lit, col}

def sum_(cols: Column*) = cols.foldLeft(lit(0))(_ + _)

val columnstosum = Seq("var1", "var2", "var3", "var4", "var5").map(col _)
df.select(sum_(columnstosum: _*))

with Python equivalent:

from functools import reduce
from operator import add
from pyspark.sql.functions import lit, col

def sum_(*cols):
    return reduce(add, cols, lit(0))

columnstosum = [col(x) for x in ["var1", "var2", "var3", "var4", "var5"]]
select("*", sum_(*columnstosum))

Both will default to NA if there is a missing value in the row. You can use DataFrameNaFunctions.fill or coalesce function to avoid that.

score 4 · Answer 3 · answered Feb 22 '18 at 06:06

I assume you have a dataframe df. Then you can sum up all cols except your ID col. This is helpful when you have many cols and you don't want to manually mention names of all columns like everyone mentioned above. This post has the same answer.

val sumAll = df.columns.collect{ case x if x != "ID" => col(x) }.reduce(_ + _)
df.withColumn("sum", sumAll)

score 2 · Answer 4 · answered Nov 09 '17 at 15:52

2

Here's an elegant solution using python:

NewDF = OldDF.withColumn('sums', sum(OldDF[col] for col in OldDF.columns[1:]))

Hopefully this will influence something similar in Spark ... anyone?.

answered Nov 09 '17 at 15:52

Aerianis

21
1

Adding a column of rowsums across a list of columns in Spark Dataframe

4 Answers4

Linked