I have a Spark dataframe with several columns. I want to add a column on to the dataframe that is a sum of a certain number of the columns.
For example, my data looks like this:
ID var1 var2 var3 var4 var5
a 5 7 9 12 13
b 6 4 3 20 17
c 4 9 4 6 9
d 1 2 6 8 1
I want a column added summing the rows for specific columns:
ID var1 var2 var3 var4 var5 sums
a 5 7 9 12 13 46
b 6 4 3 20 17 50
c 4 9 4 6 9 32
d 1 2 6 8 10 27
I know it is possible to add columns together if you know the specific columns to add:
val newdf = df.withColumn("sumofcolumns", df("var1") + df("var2"))
But is it possible to pass a list of column names and add them together? Based off of this answer which is basically what I want but it is using the python API instead of scala (Add column sum as new column in PySpark dataframe) I think something like this would work:
//Select columns to sum
val columnstosum = ("var1", "var2","var3","var4","var5")
// Create new column called sumofcolumns which is sum of all columns listed in columnstosum
val newdf = df.withColumn("sumofcolumns", df.select(columstosum.head, columnstosum.tail: _*).sum)
This throws the error value sum is not a member of org.apache.spark.sql.DataFrame. Is there a way to sum across columns?
Thanks in advance for your help.