Your logger object can't be sent to the remote executors which is why you get the serialization error.
You would have to initialize the logger inside the mapper function.
But doing that in a transform might be expensive resource-wise. Mappers should ideally be quick and light weight since they are executed on each row.
Here is how you can do it in at least Glue V3. The log events will end up in the error logs.
def transform(record):
logging.basicConfig(level=logging.INFO, format="MAPPER %(asctime)s [%(levelname)s] [%(name)s] %(message)s")
map_logger = logging.getLogger()
map_logger.info("an info event")
map_logger.error("an error event")
return record
Here's a full example script:
import logging
from awsglue.context import GlueContext
from awsglue.dynamicframe import DynamicFrame
from awsglue.transforms.dynamicframe_map import Map
from pyspark.context import SparkContext
from pyspark.sql.types import Row, IntegerType
# Configure logging for the driver
logging.basicConfig(level=logging.INFO, format='%(asctime)s [%(levelname)s] [%(name)s] %(message)s')
logger = logging.getLogger(__name__)
def main():
logger.info("======== Job Initialisation ==========")
sc = SparkContext()
glue_context = GlueContext(sc)
spark_session = glue_context.spark_session
logger.info("======== Start of ETL ==========")
df = spark_session.createDataFrame(range(1, 100), IntegerType())
dynf = DynamicFrame.fromDF(df, glue_context, "dynf")
# Apply mapper function on each row
dynf = Map.apply(frame=dynf, f=transform)
logger.info(f"Result: {dynf.show(10)}")
logger.info("Done")
def transform(record):
logging.basicConfig(level=logging.INFO, format="%(asctime)s [%(levelname)s] [%(name)s] %(message)s")
map_logger = logging.getLogger("transformer")
map_logger.info("an info event")
map_logger.error("an error event")
return record
main()