1

I am using spark 2.1 and usage is pyscripting

Problem Statement: Have an scenario where there is an need to pass multiple columns as input and return one column as output below is my input dataframe of 3 columns

a b c

S S S

S NS NS

S NS S

S S NS

NS S NS

my output has to be as below

a b c d

S S S S

S NS NS NS

S NS S S

S S NS NS

NS S NS NS

I am trying to register an UDF to pass these 3 columns[a,b,c] as input and return d column as output here a,b,c,d are the column names

I am finding difficult to get the output below is the syntax used

def return_string(x):
      if [x.a=='s' & x.b=='S' & x.c=='s']
          return 'S'
      else if[x.a=='s' & x.b=='NS' & x.c=='s']
          return 'S'
      else if[x.a=='s' & x.b=='S' & x.c=='NS']
          return 'NS;

func= udf(returnstring,types.StringType())

Can anyone please help me in completing this logic.

Alper t. Turker
  • 34,230
  • 9
  • 83
  • 115
user3292373
  • 483
  • 3
  • 8
  • 25
  • 1
    Are all three columns important? For this sample output it seems to only depend on C. – Steven Laan Aug 23 '17 at 19:28
  • Possible duplicate of [Pyspark: Pass multiple columns in UDF](https://stackoverflow.com/questions/42540169/pyspark-pass-multiple-columns-in-udf) – Steven Laan Aug 23 '17 at 19:30
  • Yes all three are important as missed one more logic if x.a=='NS' & x.b=='S' | x.c=='NS' return 'NS' but what you have mentioned is right for this sample output other columns can alone be considered – user3292373 Aug 23 '17 at 19:33

2 Answers2

6

I was trying to do it using the built-in withColumn and when functions:

from pyspark.sql.functions import col, when, lit

df.withColumn('d', when(
     ((col('A') == 'S') & (col('B') == 'S') & (col('C')=='S'))
   | ((col('A') == 'S') & (col('B') == 'NS') & (col('C')=='S'))
 , lit('S')
 ).otherwise(lit('NS'))
).show()

This is also assuming that the two values are mutually exclusive (hence the otherwise)

Steven Laan
  • 190
  • 10
5

It should be:

@udf
def return_string(a, b, c):
    if a == 's' and b == 'S' and c == 's':
        return 'S'
    if a == 's' and b == 'NS' and c == 's':
        return 'S'
    if a == 's' and b == 'S' and c == 'NS':
        return 'NS'

df = sc.parallelize([('s', 'S', 'NS'), ('?', '?', '?')]).toDF(['a', 'b', 'c'])

df.withColumn('result', return_string('a', 'b', 'c')).show()
## +---+---+---+------+
## |  a|  b|  c|result|
## +---+---+---+------+
## |  s|  S| NS|    NS|
## |  ?|  ?|  ?|  null|
## +---+---+---+------+
  • All arguments should be listed (unless you pass data as struct).
  • You should use and not & (you evaluate logical expressions not SQL expressions).
  • Conditions should be expressions not lists (non-empty list are always truthy).

Personally I'd skip all the ifs and use simple dict:

@udf
def return_string(a, b, c):
    mapping = {
        ('s', 'S', 's'): 'S',
        ('s', 'NS' 's'): 'S',
        ('s', 'S', 'NS'): 'NS',
    }
    return mapping.get((a, b, c))

Adjust conditions according to your requirements.

Overall you should prefer SQL expressions as shown in the excellent answer provided by Steven Laan (you can chain multiple conditions with when(..., ...).when(..., ...)).

Alper t. Turker
  • 34,230
  • 9
  • 83
  • 115