9

If you add any kind of logging into a UDF function in PySpark, it won't appear anywhere. Is it some kind of method to make this happen?

So far I tried standard python logging, py4j and also print.

We're running PySpark 2.3.2 with YARN cluster manager on AWS EMR clusters.

For example. Here's a function I want to use:

def parse_data(attr):
    try:
        # execute something
    except Exception as e:
        logger.error(e)
        return None

I convert it to UDF:

import pyspark.sql.functions as F
parse_data_udf = F.udf(parse_data, StringType())

And I will use it on a dataframe:

from pyspark.sql import types as pst
dataframe = dataframe.withColumn("new_column", parse_data_udf("column").cast(pst.StringType())

The logs from the function will NOT appear anywhere.

Géza Hodgyai
  • 91
  • 1
  • 2

1 Answers1

0

When using yarn you can use the following YARN CLI command to check the container logs.

This is where the stdout/stderr (and thus what you log inside the udf) is probably located.

yarn logs -applicationId <Application ID> -containerId <Container ID>
Thijs
  • 236
  • 1
  • 5