I would like to apply a binning function to the data in a column of a DataFrame, and store the result in a new column which is added to the DataFrame.
Ideally I want to make sure that I can use any custom python function with recursion because the rows in the column could be arrays, and I want to bin each element in every array. I'd also like to do other operations besides just binning the data eventually.
I know that I can add a new column by using withColumn(...)
but I do not know how to properly put in the function that generates data for that new column.
EDIT: This similar question solved part of the issue - creating user defined functions. However, it does not seem to accept lists as arguments:
def put_number_in_bin(number, bins):
if is_number(number):
number = float(number)
for i, b in enumerate(bins):
if number <= b:
bin_selected = str(i)
break
return bin_selected
else:
return str("NULL")
binning_udf = udf(lambda (x, bins): put_number_in_bin(x, bins), StringType())
bins = [0.0, 182.0, 309.4000000000001, 540.0, 846.0, 2714.0, 5872.561999999998, 10655.993999999999, 20183.062, 46350.379999999976, 4852207.7]
df_augment = df_all.withColumn("newCol1", binning_udf(df_all.total_cost, bins))
The result is this error:
TypeError: Invalid argument, not a string or column: [0.0, 182.0, 309.4000000000001, 540.0, 846.0, 2714.0, 5872.561999999998, 10655.993999999999, 20183.062, 46350.379999999976, 4852207.7] of type <type 'list'>. For column literals, use 'lit', 'array', 'struct' or 'create_map' function.