How to add a column to a pyspark DataFrame by applying a function on an already existing column?

Question

I would like to apply a binning function to the data in a column of a DataFrame, and store the result in a new column which is added to the DataFrame.

Ideally I want to make sure that I can use any custom python function with recursion because the rows in the column could be arrays, and I want to bin each element in every array. I'd also like to do other operations besides just binning the data eventually.

I know that I can add a new column by using withColumn(...) but I do not know how to properly put in the function that generates data for that new column.

EDIT: This similar question solved part of the issue - creating user defined functions. However, it does not seem to accept lists as arguments:

def put_number_in_bin(number, bins):
    if is_number(number):
        number = float(number)
        for i, b in enumerate(bins):
            if number <= b:
                bin_selected = str(i)
                break
        return bin_selected
    else:
        return str("NULL")

binning_udf = udf(lambda (x, bins): put_number_in_bin(x, bins), StringType())

bins = [0.0, 182.0, 309.4000000000001, 540.0, 846.0, 2714.0, 5872.561999999998, 10655.993999999999, 20183.062, 46350.379999999976, 4852207.7]

df_augment = df_all.withColumn("newCol1", binning_udf(df_all.total_cost, bins))

The result is this error:

TypeError: Invalid argument, not a string or column: [0.0, 182.0, 309.4000000000001, 540.0, 846.0, 2714.0, 5872.561999999998, 10655.993999999999, 20183.062, 46350.379999999976, 4852207.7] of type <type 'list'>. For column literals, use 'lit', 'array', 'struct' or 'create_map' function.

Possible duplicate of [How can I use a function in dataframe withColumn function in Pyspark?](https://stackoverflow.com/questions/44259528/how-can-i-use-a-function-in-dataframe-withcolumn-function-in-pyspark) — Jesse Amano, Apr 02 '19 at 22:08
@JesseAmano Unfortunately that does not solve the whole question as I don't know how to make a udf that handles lists. Edited to add more detail. — AAC, Apr 03 '19 at 04:30
A UDF can only take column arguments, not lists. However, in your case the list to be used does not appear to need to change (and actually if it did it would probably already be a column). To be more explicit: bins should not be an argument in the lambda you create the UDF with. Instead you want to close over that value. — Jesse Amano, Apr 03 '19 at 06:18
I would like to bin several of these columns which have different bins, would that mean that I need to create several different UDFs? — AAC, Apr 03 '19 at 06:47
I guess it would depend on how your bins are derived and how they relate to your columns. You might try searching the Web for information on how to use the NTILE window function and see if that fits your use case. Otherwise, you will probably need to create a UDF for each column to be transformed. This might not be so terrible if you can pass the column data through another function to build and return each UDF. — Jesse Amano, Apr 03 '19 at 07:23

How to add a column to a pyspark DataFrame by applying a function on an already existing column?

0 Answers0