I'm very new to this, so not sure if this script could be simplified/if I'm doing something wrong that's resulting in this happening. I've written an ETL script for AWS Glue that writes to a directory within an S3 bucket.
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.dynamicframe import DynamicFrame
from awsglue.job import Job
## @params: [JOB_NAME]
args = getResolvedOptions(sys.argv, ['JOB_NAME'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
# catalog: database and table names
db_name = "events"
tbl_base_event_info = "base_event_info"
tbl_event_details = "event_details"
# output directories
output_dir = "s3://whatever/output"
# create dynamic frames from source tables
base_event_source = glueContext.create_dynamic_frame.from_catalog(database = db_name, table_name = tbl_base_event_info)
event_details_source = glueContext.create_dynamic_frame.from_catalog(database = db_name, table_name = tbl_event_details)
# join frames
base_event_source_df = workout_event_source.toDF()
event_details_source_df = workout_device_source.toDF()
enriched_event_df = base_event_source_df.join(event_details_source_df, "event_id")
enriched_event = DynamicFrame.fromDF(enriched_event_df, glueContext, "enriched_event")
# write frame to json files
datasink = glueContext.write_dynamic_frame.from_options(frame = enriched_event, connection_type = "s3", connection_options = {"path": output_dir}, format = "json")
job.commit()
The base_event_info
table has 4 columns: event_id
, event_name
, platform
, client_info
The event_details
table has 2 columns: event_id
, event_details
The joined table schema should look like: event_id
, event_name
, platform
, client_info
, event_details
After I run this job, I expected to get 2 json files, since that's how many records are in the resulting joined table. (There are two records in the tables with the same event_id
) However, what I get is about 200 files in the form of run-1540321737719-part-r-00000
, run-1540321737719-part-r-00001
, etc:
- 198 files contain 0 bytes
- 2 files contain 250 bytes (each with the correct info corresponding to the enriched events)
Is this the expected behavior? Why is this job generating so many empty files? Is there something wrong with my script?