I have a function in PySpark that takes two input and returns two output:
def get_seen_cards(x, y):
if 1 in x:
alreadyFailed = 1
else:
alreadyFailed = 0
if y:
alreadyAuthorized = 1
else:
alreadyAuthorized = 0
return alreadyFailed, alreadyAuthorized
And I want to apply this function with udf to get the whole dataframe treated like this:
get_seen_cards_udf = udf(lambda x, y : get_seen_cards_spark(x, y), IntegerType())
data.withColumn(["alr_failed", "alr_auth"], get_seen_cards_udf(data["card_uid"], data["failed"]))
Where data["card_uid"]
looks like this :
[Row(card_uid='card_1'),
Row(card_uid='card_2'),
Row(card_uid='card_3'),
Row(card_uid='card_4'),
Row(card_uid='card_5')]
and data["failed"]
look like this :
[Row(failed=False),
Row(failed=False),
Row(failed=False),
Row(failed=True),
Row(failed=False)]
But this is obvioulsy not working because withColumn only works for one column.
I need to add two columns at the same time in my dataframe, the first is the results of the first return of the function and will be stored in "alr_failed
" and the other column is the second value of the return and will be store in "alr_auth
".
The idea is to return a dataframe with the following columns after treatment :
card_uid, failed, alr_failed, alr_auth
Is it even possible someway ? Or is there's a workaround for this ?