-2

I am trying to use if condition inside a python function and then use it to make some calcs with dataframe values.

#init data
+---+----+----+------+
| id|team|game|result|
+---+----+----+------+
|  1|   A|Home|      |
|  2|   A|Away|      |
|  3|   B|Home|      |
|  4|   B|Away|      |
|  5|   C|Home|      |
|  6|   C|Away|      |
|  7|   D|Home|      |
|  8|   D|Away|      |
+---+----+----+------+

### I wanna replace the value result and I tried use a function

def replace_result(team_name,game_kind,result):
  if col('team') == team_name and col('game') == game_kind:
     return result
  else:
     return col('result')

df = df.withColumn('result',replace_result('A','Away','0-1')

but gave me the error

ValueError: Cannot convert column into bool: please use '&' for 'and', '|' for 'or', '~' for 'not' when building DataFrame boolean expressions.

My question is

Is it possible to use if conditions using Pyspark dataframe columns?

Thanks

bigdataadd
  • 191
  • 1
  • 10

2 Answers2

1

Yes, there are the built-in spark.sql's functions called when and otherwise that do that.

With the following dataframe.

df.show()
+---+----+----+
| id|team|game|
+---+----+----+
|  1|   A|Home|
|  2|   A|Away|
|  3|   B|Home|
|  4|   B|Away|
|  5|   C|Home|
|  6|   C|Away|
|  7|   D|Home|
|  8|   D|Away|
+---+----+----+

You can use when and otherwise conditions in the following way.

from pyspark.sql import functions

df = (df.withColumn("result", 
        functions.when((df["team"] == "A") & (df["game"] == "Home"), "WIN")
                 .when((df["team"] == "B") & (df["game"] == "Away"), "WIN")
                 .when((df["team"] == "D") & (df["game"] == "Home"), "WIN")
                 .when((df["team"] == "D") & (df["game"] == "Away"), "WIN")
                 .otherwise("LOSS")))

df.show()
+---+----+----+------+
| id|team|game|result|
+---+----+----+------+
|  1|   A|Home|   WIN|
|  2|   A|Away|  LOSS|
|  3|   B|Home|  LOSS|
|  4|   B|Away|   WIN|
|  5|   C|Home|  LOSS|
|  6|   C|Away|  LOSS|
|  7|   D|Home|   WIN|
|  8|   D|Away|   WIN|
+---+----+----+------+

BoomBoxBoy
  • 1,770
  • 1
  • 5
  • 23
0

You need to use a udf for custom code when using DataFrames.

https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.functions.udf.html

toni057
  • 582
  • 1
  • 4
  • 10