I have a PySpark DataFrame and I have tried many examples showing how to create a new column based on operations with existing columns, but none of them seem to work.
So I have t̶w̶o̶ one questions:
1- Why doesn't this code work?
from pyspark import SparkContext, SparkConf
from pyspark.sql import SQLContext
import pyspark.sql.functions as F
sc = SparkContext()
sqlContext = SQLContext(sc)
a = sqlContext.createDataFrame([(5, 5, 3)], ['A', 'B', 'C'])
a.withColumn('my_sum', F.sum(a[col] for col in a.columns)).show()
I get the error:
TypeError: Column is not iterable
EDIT: Answer 1
I found out how to make this work. I have to use the native Python sum
function. a.withColumn('my_sum', F.sum(a[col] for col in a.columns)).show()
. It works, but I have no idea why.
2- If there is a way to make this sum work, how can I write a udf
function to do this (and add the result to a new column of a DataFrame)?
import numpy as np
def my_dif(row):
d = np.diff(row) # creates an array of differences element by element
return d.mean() # returns the mean of the array
I am using Python 3.6.1 and Spark 2.1.1.
Thank you!