0

TL;DR

  • I'm trying to consolidate many S3 data-files into a fewer number using a Glue [Studio] job
  • Input data is Catalogued in Glue and queryable via Athena
  • Glue Job runs with "Succeeded" output status, but no output files are created

Details

Input I have data that's being created from a scraper on a once-per-minute cycle. It's dumping the output in JSON (gzip) format to a bucket. I have this bucket catalogued in Glue and can query against it, with no errors, using Athena. This makes me feel more confident that I have the Catalogue and data-structure set up correctly. Alone, this isn't ideal as it creates ~1.4K files per day, which makes the queries against the data (via Athena) quite slow as they have to scan way too many, far too small files

Goal I'd like to periodically (probably once per week, month, I'm not sure yet) consolidate the once-per-minute files into far fewer, so that queries are scanning bigger and less numerous files (faster queries).

Approach My plan is to create a Glue ETL job (using Glue Studio) to read from the Catalogue Table, and write to a new S3 location (maintaining the same JSON-gzip format, so I can just re-point the Glue table to the new S3 location with the consolidated files). I set up the job using Glue Studio, and when I run it it says is succeeded, but there's no output to the S3 location specified (not empty files, just nothing at all).

Stuck! I'm at a bit of a loss, since (1) it says it's succeeding, and (2) I'm not even modifying the script (see below), so I'd presume (maybe a bad idea) that it's not that.

Logs I've tried going through the CloudWatch logs to see if it'll help, but I don't get much out of there. I suspect it may have something to do with this entry, but I can't find a way to either confirm that or change anything to "fix" it. (The path definitely exists, verified by the fact that I can see it in S3, the Catalogue can search it as verified by Athena queries, and it's auto-generated by the Glue Studio script-builder.) To me it sounds like I've selected, somewhere, an option that makes it think I only want some sort of "incremental" scan of the data. But I haven't (knowingly), nor can I find anywhere that would make it seem I have.

CloudWatch Log Entry

21/03/13 17:59:39 WARN HadoopDataSource: Skipping Partition {} as no new files detected @ s3://my_bucket/my_folder/my_source_data/ or path does not exist

Glue Script

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job

## @params: [JOB_NAME]
args = getResolvedOptions(sys.argv, ['JOB_NAME'])

sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
## @type: DataSource
## @args: [database = "my_database", table_name = "my_table", transformation_ctx = "DataSource0"]
## @return: DataSource0
## @inputs: []
DataSource0 = glueContext.create_dynamic_frame.from_catalog(database = "my_database", table_name = "my_table", transformation_ctx = "DataSource0")
## @type: DataSink
## @args: [connection_type = "s3", format = "json", connection_options = {"path": "s3://my_bucket/my_folder/consolidation/", "compression": "gzip", "partitionKeys": []}, transformation_ctx = "DataSink0"]
## @return: DataSink0
## @inputs: [frame = DataSource0]
DataSink0 = glueContext.write_dynamic_frame.from_options(frame = DataSource0, connection_type = "s3", format = "json", connection_options = {"path": "s3://my_bucket/my_folder/consolidation/", "compression": "gzip", "partitionKeys": []}, transformation_ctx = "DataSink0")
job.commit()

Other Posts I Researched First

None have the same problem of a "Succeeded" job providing no output. However, one had empty files being created, while another too many files. The most interesting approach was using Athena to create the new output file for you (with an external table); however, when I looked into that, it appeared that the output format options would not have JSON-gzip (or JSON without gzip), but only CSV and Parquet, which are non-preferred for my use.

How to Convert Many CSV files to Parquet using AWS Glue

AWS Glue: ETL job creates many empty output files

AWS Glue Job - Writing into single Parquet file

AWS Glue, output one file with partitions

Matt
  • 907
  • 1
  • 8
  • 17

1 Answers1

0

datasource_df = DataSource0.repartition(1)

DataSink0 = glueContext.write_dynamic_frame.from_options(frame = datasource_df, connection_type = "s3", format = "json", connection_options = {"path": "s3://my_bucket/my_folder/consolidation/", "compression": "gzip", "partitionKeys": []}, transformation_ctx = "DataSink0")

job.commit()

Guest
  • 1
  • 1
    Please provide a detailed explanation to your answer, in order for the next user to understand your answer better. Also, use the right formating for the code and for the text, both of which you can find above the input field of the answer once you start editig it – Elydasian Jul 28 '21 at 12:28