I have a dataframe and I'd like to filter some rows out based on one column. But my conditions are quite complex and will require a separate function, it's not something I can do in a single expression or where clause.
My plan was to return True or False according to whether to keep or filter out that row:
from pyspark.sql.types import BooleanType
from pyspark.sql.function import udf
def my_filter(col1):
# I will have more conditions here, but for simplicity...
if col1 is null:
return True
my_filter_udf = udf(my_filter, BooleanType())
df = table1_df \
.select( \
'col1' \
) \
.filter(my_filter_udf('col1'))
.show()
I get the error "SparkException: Exception thrown in Future.get".
What's weird is in the function, if I just use:
def my_filter(col1):
return True
It works. But if I reference the col1
value in the function, I get this error. So the code is reaching and parsing the function, it just doesn't seem to like what I'm passing to it.
Any idea where I'm going wrong here?