0

I have an AWS Glue job that targets to generate a monthly report

I am using aws s3 parquet format as my source and validating query at athena

current issue is when my aws glue job runs the following day it only aggregated 1 day only

adv_amt |adv_fee|adv_txn|adv_uniq_user|credit_date
300.0   |60.0   |6      |2            | 2023-05-01
260.0   |52.0   |6      |3            | 2023-04-01
170.0   |34.0   |5      |3            | 2023-03-01

my target output scenario is i want to run it daily and update current month.

eg. today is 2023-05-19

  1. my current result is aggregated from 2023-05-01 to 2023-05-19 as 2023-05-01
  2. if i run the job on the following day it should be aggregated from 2023-05-01 to 2023-05-20 as 2023-05-01

any idea if there is a way to get my expected result?

c0ng111
  • 31
  • 3
  • Can't you just read all parquet files of the current month? – lsc May 19 '23 at 14:54
  • Structure your directory structure like `s3://bucket-name/table-name/year=2023/month=05/*.parquet`. I assume you are talking about Glue ETL job which is based on Spark. – lsc May 19 '23 at 16:17
  • sorry but i'm new with aws services, what do you mean read all parquets for the current month? what i read was it has something to do with the bookmarks. it examines the data to determine if it contains bookmarkable data. Bookmarkable data typically includes columns with a monotonically increasing value like a timestamp or an incrementing ID. how can i just reset the bookmark for the current month? – c0ng111 May 19 '23 at 16:19
  • Structure your directory structure like s3://bucket-name/table-name/year=2023/month=05/*.parquet. I assume you are talking about Glue ETL job which is based on Spark --> is this for the source directory? because our current source directory is s3://bucket-name/table-name/year=2023/month=05/day=01/*.parquet – c0ng111 May 19 '23 at 16:21
  • Disable job bookmark; then create a dynamicframe (or Spark dataframe) from `s3://bucket-name/table-name/year=2023/month=05/` – lsc May 22 '23 at 13:30

0 Answers0