4

I am able to follow the instructions in https://docs.aws.amazon.com/glue/latest/dg/monitor-continuous-logging-enable.html, and log messages in driver. But when I try to use the logger inside the map function like this

sc = SparkContext()
glueContext = GlueContext(sc)
logger = glueContext.get_logger()
logger.info("starting glue job...") #successful
...
def transform(item):
    logger.info("starting transform...") #error
    ...transform logics...

Map.apply(frame = dynamicFrame, f = transform)

I get this error:

PicklingError: Could not serialize object: TypeError: can't pickle _thread.RLock objects

I researched around and the message implies that the logger object cannot be serialized when passed to the worker.

What's the correct way to do logging in AWS Glue worker?

ZygD
  • 22,092
  • 39
  • 79
  • 102
  • https://stackoverflow.com/a/52064491/4326922 – Prabhakar Reddy Sep 14 '21 at 01:55
  • @PrabhakarReddy Thank you for the link! But that's not what I wanted. That post talks about how to aggregates logs and sent back to driver, but I want to log stuff on the executor itself. Glue has a the executor log stream that I can access, but I can't find a way to put logs to it – Xiqiang Lin Sep 21 '21 at 00:19
  • 1
    @XiqiangLin, did you ever figure this out? – TheHud Jan 19 '22 at 16:38

2 Answers2

1

Your logger object can't be sent to the remote executors which is why you get the serialization error. You would have to initialize the logger inside the mapper function.

But doing that in a transform might be expensive resource-wise. Mappers should ideally be quick and light weight since they are executed on each row.

Here is how you can do it in at least Glue V3. The log events will end up in the error logs.

def transform(record):
    logging.basicConfig(level=logging.INFO, format="MAPPER %(asctime)s [%(levelname)s] [%(name)s] %(message)s")
    map_logger = logging.getLogger()
    map_logger.info("an info event")
    map_logger.error("an error event")

    return record

Here's a full example script:

import logging

from awsglue.context import GlueContext
from awsglue.dynamicframe import DynamicFrame
from awsglue.transforms.dynamicframe_map import Map
from pyspark.context import SparkContext
from pyspark.sql.types import Row, IntegerType

# Configure logging for the driver
logging.basicConfig(level=logging.INFO, format='%(asctime)s [%(levelname)s] [%(name)s] %(message)s')
logger = logging.getLogger(__name__)


def main():
    logger.info("======== Job Initialisation ==========")

    sc = SparkContext()
    glue_context = GlueContext(sc)
    spark_session = glue_context.spark_session

    logger.info("======== Start of ETL ==========")

    df = spark_session.createDataFrame(range(1, 100), IntegerType())

    dynf = DynamicFrame.fromDF(df, glue_context, "dynf")

    # Apply mapper function on each row
    dynf = Map.apply(frame=dynf, f=transform)

    logger.info(f"Result: {dynf.show(10)}")

    logger.info("Done")


def transform(record):
    logging.basicConfig(level=logging.INFO, format="%(asctime)s [%(levelname)s] [%(name)s] %(message)s")
    map_logger = logging.getLogger("transformer")
    map_logger.info("an info event")
    map_logger.error("an error event")

    return record

main()
selle
  • 868
  • 1
  • 10
  • 27
1

I faced the same issue in Glue and even tried to contact AWS but no luck then @selle answer helped me. but I found that without logger you can see the print in error logs from executors. you just have to deep dive into error logs.

this will give you clear picture about Glue logs... https://docs.aws.amazon.com/glue/latest/dg/reduced-start-times-spark-etl-jobs.html#reduced-start-times-logging

btw this is my first post in stack overflow

Vignxsh s
  • 31
  • 3