Pyspark - create new column from operations of DataFrame columns gives error "Column is not iterable"

Question

I have a PySpark DataFrame and I have tried many examples showing how to create a new column based on operations with existing columns, but none of them seem to work.

So I have t̶w̶o̶ one questions:

1- Why doesn't this code work?

from pyspark import SparkContext, SparkConf
from pyspark.sql import SQLContext
import pyspark.sql.functions as F

sc = SparkContext()
sqlContext = SQLContext(sc)

a = sqlContext.createDataFrame([(5, 5, 3)], ['A', 'B', 'C'])
a.withColumn('my_sum', F.sum(a[col] for col in a.columns)).show()

I get the error: TypeError: Column is not iterable

EDIT: Answer 1

I found out how to make this work. I have to use the native Python sum function. a.withColumn('my_sum', F.sum(a[col] for col in a.columns)).show(). It works, but I have no idea why.

2- If there is a way to make this sum work, how can I write a udf function to do this (and add the result to a new column of a DataFrame)?

import numpy as np
def my_dif(row):
    d = np.diff(row) # creates an array of differences element by element
    return d.mean() # returns the mean of the array

I am using Python 3.6.1 and Spark 2.1.1.

Thank you!

@eliasah, the answer for me was to use the native Python `sum` function. I've edited my post with the example. — Hannon Queiroz, Jun 10 '17 at 18:01

score 1 · Accepted Answer · answered Jun 08 '17 at 01:52

1

a = sqlContext.createDataFrame([(5, 5, 3)], ['A', 'B', 'C'])
a = a.withColumn('my_sum', F.UserDefinedFunction(lambda *args: sum(args), IntegerType())(*a.columns))
a.show()

+---+---+---+------+
|  A|  B|  C|my_sum|
+---+---+---+------+
|  5|  5|  3|    13|
+---+---+---+------+

answered Jun 08 '17 at 01:52

Zhang Tong

4,569
3
19
38

Thank you! This works for me. But can you also explain how the parameters work in that example? I am a bit confused by the `*args` and `*a.columns`. – Hannon Queiroz Jun 10 '17 at 18:03

score 0 · Answer 2 · answered Jun 08 '17 at 01:19

0

Your problem is in this part for col in a.columns cuz you cannot iterate the result, so you must:

a = a.withColumn('my_sum', a.A + a.B + a.C)

answered Jun 08 '17 at 01:19

Daniel Lopes

814
5
20

Thanks for your answer, dannyeuu. But the problem is that in my real dataset I have hundreds of columns, so I can't type in one by one explicitly. I got this from [the version 2 of this post](https://stackoverflow.com/a/31955747/5420229), and I've also seen something like this in many other answers. – Hannon Queiroz Jun 08 '17 at 01:35
1

Something like this https://stackoverflow.com/questions/36584812/pyspark-row-wise-function-composition – Daniel Lopes Jun 08 '17 at 01:58

Pyspark - create new column from operations of DataFrame columns gives error "Column is not iterable"

2 Answers2

Linked