TL;DR
- I'm trying to consolidate many S3 data-files into a fewer number using a Glue [Studio] job
- Input data is Catalogued in Glue and queryable via Athena
- Glue Job runs with "Succeeded" output status, but no output files are created
Details
Input I have data that's being created from a scraper on a once-per-minute cycle. It's dumping the output in JSON (gzip) format to a bucket. I have this bucket catalogued in Glue and can query against it, with no errors, using Athena. This makes me feel more confident that I have the Catalogue and data-structure set up correctly. Alone, this isn't ideal as it creates ~1.4K files per day, which makes the queries against the data (via Athena) quite slow as they have to scan way too many, far too small files
Goal I'd like to periodically (probably once per week, month, I'm not sure yet) consolidate the once-per-minute files into far fewer, so that queries are scanning bigger and less numerous files (faster queries).
Approach My plan is to create a Glue ETL job (using Glue Studio) to read from the Catalogue Table, and write to a new S3 location (maintaining the same JSON-gzip format, so I can just re-point the Glue table to the new S3 location with the consolidated files). I set up the job using Glue Studio, and when I run it it says is succeeded, but there's no output to the S3 location specified (not empty files, just nothing at all).
Stuck! I'm at a bit of a loss, since (1) it says it's succeeding, and (2) I'm not even modifying the script (see below), so I'd presume (maybe a bad idea) that it's not that.
Logs I've tried going through the CloudWatch logs to see if it'll help, but I don't get much out of there. I suspect it may have something to do with this entry, but I can't find a way to either confirm that or change anything to "fix" it. (The path definitely exists, verified by the fact that I can see it in S3, the Catalogue can search it as verified by Athena queries, and it's auto-generated by the Glue Studio script-builder.) To me it sounds like I've selected, somewhere, an option that makes it think I only want some sort of "incremental" scan of the data. But I haven't (knowingly), nor can I find anywhere that would make it seem I have.
CloudWatch Log Entry
21/03/13 17:59:39 WARN HadoopDataSource: Skipping Partition {} as no new files detected @ s3://my_bucket/my_folder/my_source_data/ or path does not exist
Glue Script
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
## @params: [JOB_NAME]
args = getResolvedOptions(sys.argv, ['JOB_NAME'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
## @type: DataSource
## @args: [database = "my_database", table_name = "my_table", transformation_ctx = "DataSource0"]
## @return: DataSource0
## @inputs: []
DataSource0 = glueContext.create_dynamic_frame.from_catalog(database = "my_database", table_name = "my_table", transformation_ctx = "DataSource0")
## @type: DataSink
## @args: [connection_type = "s3", format = "json", connection_options = {"path": "s3://my_bucket/my_folder/consolidation/", "compression": "gzip", "partitionKeys": []}, transformation_ctx = "DataSink0"]
## @return: DataSink0
## @inputs: [frame = DataSource0]
DataSink0 = glueContext.write_dynamic_frame.from_options(frame = DataSource0, connection_type = "s3", format = "json", connection_options = {"path": "s3://my_bucket/my_folder/consolidation/", "compression": "gzip", "partitionKeys": []}, transformation_ctx = "DataSink0")
job.commit()
Other Posts I Researched First
None have the same problem of a "Succeeded" job providing no output. However, one had empty files being created, while another too many files. The most interesting approach was using Athena to create the new output file for you (with an external table); however, when I looked into that, it appeared that the output format options would not have JSON-gzip (or JSON without gzip), but only CSV and Parquet, which are non-preferred for my use.
How to Convert Many CSV files to Parquet using AWS Glue
AWS Glue: ETL job creates many empty output files