11

AWS Glue docs clearly states that Crawlers scrapes metadata information from the source (JDBS or s3) and populates Data Catalog (creates/updates DB and corresponding tables).

However, it's not clear whether we need to run a crawler regularly to detect new data in a source (ie, new objects on s3, new rows in db table) if we know that there no scheme/partitioning changes.

So, is it required to run a crawler prior to running an ETL job to be able to pick up a new data?

Yuriy Bondaruk
  • 4,512
  • 2
  • 33
  • 49

2 Answers2

6

AWS Glue will automatically detect new data in S3 buckets so long as it's within your existing folders (partitions).

If data is added to new folders (partitions), you need to reload your partitions using MSCK REPAIR TABLE mytable;.

RobinL
  • 11,009
  • 8
  • 48
  • 68
  • 1
    It means if data is partitioned by date (year/month/day) and it comes continuously then before running a Glue job I have to run `MSCK REPAIR TABLE mytable;` at least once a day to be able to get latest data from a new 'day' folder. Is there a way to automate running the command (trigger)? Or invoke it from a Glue job script before processing? – Yuriy Bondaruk Apr 15 '18 at 13:24
  • I think probably a scheduled lambda may be the easiest way - see [here](https://stackoverflow.com/questions/47546670/how-to-make-msck-repair-table-execute-automatically-in-aws-athena). Another possibility is that you can use boto3 within the job running in Glue, so you should be able to use [this](http://boto3.readthedocs.io/en/latest/reference/services/athena.html#Athena.Client.start_query_execution) to execute the `MSCK REPAIR TABLE` command. – RobinL Apr 15 '18 at 13:38
  • 1
    Thanks a lot, it's very helpful! However, I started thinking about reading directly from s3 instead of Data Catalog in Glue job so that I always have the latest data without a need to run additional commands. – Yuriy Bondaruk Apr 15 '18 at 13:56
  • Yes, that's another good approach. For raw /intermediate data we also often just use spark.read.csv or similar rather than bothering to load into catalog – RobinL Apr 15 '18 at 15:13
  • Very useful answer @RobinL, does that mean if for example, crawler points to a folder (say, Parent folder) and this Parent has multiple subfolders (say abc, xyz,etc.), now if we add another subfolder (say, pqr) then would the file inside pqr subfolder be automatically detected by crawler (i.e without running it again)? thanks – pc_pyr Sep 11 '20 at 15:48
  • @pc_pyr yes, so long as you run `MSCK REPAIR TABLE mytable;` to detect the new folder, and the schema of the file inside `pqr` is the same as the schema of the files inside `abc` `xyz` . – RobinL Sep 12 '20 at 10:09
  • you can also schedule the glue crawler; https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-properties-glue-crawler-schedule.html – Vincent Claes Feb 03 '21 at 10:09
0

It's necessary to run the crawler prior to the job.

The crawler replaces Athena MSCK REPAIR TABLE and also updates the table with new columns as they're added.

Ricardo Mayerhofer
  • 2,121
  • 20
  • 22