Pyspark udf returning one column on condition definitions accepting several columns as input

Question

I am using spark 2.1 and usage is pyscripting

Problem Statement: Have an scenario where there is an need to pass multiple columns as input and return one column as output below is my input dataframe of 3 columns

a b c

S S S

S NS NS

S NS S

S S NS

NS S NS

my output has to be as below

a b c d

S S S S

S NS NS NS

S NS S S

S S NS NS

NS S NS NS

I am trying to register an UDF to pass these 3 columns[a,b,c] as input and return d column as output here a,b,c,d are the column names

I am finding difficult to get the output below is the syntax used

def return_string(x):
      if [x.a=='s' & x.b=='S' & x.c=='s']
          return 'S'
      else if[x.a=='s' & x.b=='NS' & x.c=='s']
          return 'S'
      else if[x.a=='s' & x.b=='S' & x.c=='NS']
          return 'NS;

func= udf(returnstring,types.StringType())

Can anyone please help me in completing this logic.

Are all three columns important? For this sample output it seems to only depend on C. — Steven Laan, Aug 23 '17 at 19:28
Possible duplicate of [Pyspark: Pass multiple columns in UDF](https://stackoverflow.com/questions/42540169/pyspark-pass-multiple-columns-in-udf) — Steven Laan, Aug 23 '17 at 19:30
Yes all three are important as missed one more logic if x.a=='NS' & x.b=='S' | x.c=='NS' return 'NS' but what you have mentioned is right for this sample output other columns can alone be considered — user3292373, Aug 23 '17 at 19:33

score 6 · Accepted Answer · answered Aug 23 '17 at 19:56

I was trying to do it using the built-in withColumn and when functions:

from pyspark.sql.functions import col, when, lit

df.withColumn('d', when(
     ((col('A') == 'S') & (col('B') == 'S') & (col('C')=='S'))
   | ((col('A') == 'S') & (col('B') == 'NS') & (col('C')=='S'))
 , lit('S')
 ).otherwise(lit('NS'))
).show()

This is also assuming that the two values are mutually exclusive (hence the otherwise)

Alper t. Turker · Answer 2 · 2017-08-23T20:23:43.840

It should be:

@udf
def return_string(a, b, c):
    if a == 's' and b == 'S' and c == 's':
        return 'S'
    if a == 's' and b == 'NS' and c == 's':
        return 'S'
    if a == 's' and b == 'S' and c == 'NS':
        return 'NS'

df = sc.parallelize([('s', 'S', 'NS'), ('?', '?', '?')]).toDF(['a', 'b', 'c'])

df.withColumn('result', return_string('a', 'b', 'c')).show()
## +---+---+---+------+
## |  a|  b|  c|result|
## +---+---+---+------+
## |  s|  S| NS|    NS|
## |  ?|  ?|  ?|  null|
## +---+---+---+------+

All arguments should be listed (unless you pass data as struct).
You should use and not & (you evaluate logical expressions not SQL expressions).
Conditions should be expressions not lists (non-empty list are always truthy).

Personally I'd skip all the ifs and use simple dict:

@udf
def return_string(a, b, c):
    mapping = {
        ('s', 'S', 's'): 'S',
        ('s', 'NS' 's'): 'S',
        ('s', 'S', 'NS'): 'NS',
    }
    return mapping.get((a, b, c))

Adjust conditions according to your requirements.

Overall you should prefer SQL expressions as shown in the excellent answer provided by Steven Laan (you can chain multiple conditions with when(..., ...).when(..., ...)).

Pyspark udf returning one column on condition definitions accepting several columns as input

2 Answers2