Pyspark Dataframes: Does when(cond,value) always evaluate value?

Question

so I am trying to conditionally apply an udf some_function() to column b1, based on the value in a1. (otherwise don't apply). Using pyspark.sql.functions.when(condition, value) and a simple udf

some_function = udf(lambda x: x.translate(...))    
df = df.withColumn('c1',when(df.a1 == 1, some_function(df.b1)).otherwise(df.b1))

With this example data:

|    a1|    b1|
---------------
|     1|'text'|
|     2|  null|

I am seeing that some_function() is always evaluated (i.e. the udf calls translate() on null and crashes), regardless of condition and applied if condition is true. To clarify, this is not about udfs handling null correctly, but when(...) always executing value, if value is an udf.

Is this behaviour intended? If so, how can I apply a method conditionally so it doesn't get executed if condition is not met?

Why do you say that _"It seems like some_function() is always evaluated"_? — pault, Mar 13 '18 at 20:18
You are right, it is always executed, regardless of the condition given in when(). changed the question to make this more clear. — fluxens, Mar 14 '18 at 08:25
In this example some_function() will throw an exception if x == null. (cannot call translate() on null). In my dataset there is no a1 == 1 with b1 == null, BUT there is a1 != 1 with b1 == null. I've udpated the question. — fluxens, Mar 14 '18 at 13:58
Ok I was able to reproduce your issue, but I do not believe it's because translate is being called on null. Try this: `some_function = udf(lambda x: str(x).translate(...))`. See more [here](https://stackoverflow.com/questions/10367302/what-is-producing-typeerror-character-mapping-must-return-integer-in-this-p) and [here](https://stackoverflow.com/questions/1324067/how-do-i-get-str-translate-to-work-with-unicode-strings). — pault, Mar 15 '18 at 17:44
Today I found out that there's a function [`pyspark.sql.functions.translate()`](http://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html#pyspark.sql.functions.translate) that should do what you want without using a `udf`. — pault, Mar 27 '18 at 15:02
So I was wrong. I ran into the [same issue](https://stackoverflow.com/questions/49634651/using-udf-ignores-condition-in-when) myself and it turns out that this is a result of the optimizer doing batch processing for python `udf`s. My solution of converting to `str` probably worked because it converted the value `null` into the string `"null"`. In any case, your `udf` has to be robust to handle bad inputs, or use the builtin `pyspark.sql.functions.translate()` — pault, Apr 04 '18 at 15:26
Possible duplicate of [Using UDF ignores condition in when](https://stackoverflow.com/questions/49634651/using-udf-ignores-condition-in-when) — pault, Apr 04 '18 at 15:27
So the answer is: udfs are always executed when used in when(), regardless of condition. So make them robust enough, and make sure the execution does not cause side effects. — fluxens, Apr 09 '18 at 11:33
I can't say for sure "always executed" but the rest is accurate. — pault, Apr 09 '18 at 11:58

Pyspark Dataframes: Does when(cond,value) always evaluate value?

0 Answers0