How to find the sum of the elements of multiple columns in a dataframe and create a new column with the result in PySpark?

Question

I need to add the elements of columns t1, t2, t3, t4, t5 and create a new column with the result called "totaltime" in PySpark. The dataframe is of the following format:

 +--------+--------+------+------+------+------+
 |    Ser |    t1  |  t2  |  t3  |  t4  |  t5  |
 +--------+--------+------+------+------+------+
 |07142017|      84|   187|   214|   119|     7|
 |20170714|      84|   187|   209|   115|     8|
 |20170715|      83|   188|   208|   119|     6|
 |20170716|      84|   188|   206|   106|     5|
 |20170714|      86|   188|   209|   119|     4|
 +--------+--------+------+------+------+------+

I wrote the following code:

sum1 = df1.select("t1","t2","t3","t4","t5").sum()
df1 = df1.withColumn("totaltime",sum1)

I get the following error:

AttributeError: 'DataFrame' object has no attribute 'sum'

How do I do this in PySpark?

is that data in a hive table or hdfs file that your re trying to pull in a dataframe and add then? — raul, Aug 09 '17 at 19:16

score 0 · Accepted Answer · answered Aug 09 '17 at 19:23

0

Try this out

 df1 = df1.withColumn('totaltime', sum(df1[col] for col in ["t1","t2","t3","t4","t5"]))

answered Aug 09 '17 at 19:23

raul

631
2
10
23

How to find the sum of the elements of multiple columns in a dataframe and create a new column with the result in PySpark?

1 Answers1