If you add any kind of logging into a UDF function in PySpark, it won't appear anywhere. Is it some kind of method to make this happen?
So far I tried standard python logging, py4j and also print.
We're running PySpark 2.3.2 with YARN cluster manager on AWS EMR clusters.
For example. Here's a function I want to use:
def parse_data(attr):
try:
# execute something
except Exception as e:
logger.error(e)
return None
I convert it to UDF:
import pyspark.sql.functions as F
parse_data_udf = F.udf(parse_data, StringType())
And I will use it on a dataframe:
from pyspark.sql import types as pst
dataframe = dataframe.withColumn("new_column", parse_data_udf("column").cast(pst.StringType())
The logs from the function will NOT appear anywhere.