Questions tagged [aws-glue]

AWS Glue is a fully managed ETL (extract, transform, and load) service that can categorize your data, clean it, enrich it, and move it between various data stores. AWS Glue consists of a central data repository known as the AWS Glue Data Catalog, an ETL engine that automatically generates Python code, and a scheduler that handles dependency resolution, job monitoring, and retries. AWS Glue is serverless, so there's no infrastructure to manage.

AWS Glue consists of a number of components components:

  1. A data catalog (implementing functionality of a Hive Metastore) across AWS data sources, primarily S3, but also any JDBC data source on AWS including Amazon RDS and Amazon Redshift
  2. Crawlers, which perform data classification and schema discovery across S3 data and register data with the data catalog
  3. A distributed data processing framework which extends PySpark with functionality for increased schema flexibility.
  4. Code generation tools to template and bootstrap data processing scripts
  5. Scheduling for crawlers and data processing scripts
  6. Serverless development and execution of scripts in an Apache Spark (2.x) environment.

Data registered in the AWS Glue Data Catalog is available to many AWS Services, including

  • Amazon Redshift Spectrum
  • EMR (Hadoop, Hive, HBase, Presto, Spark, Impala, etc.)
  • Amazon Athena
  • AWS Glue scripts
4003 questions
45
votes
8 answers

AWS Glue Crawler Not Creating Table

I have a crawler I created in AWS Glue that does not create a table in the Data Catalog after it successfully completes. The crawler takes roughly 20 seconds to run and the logs show it successfully completed. CloudWatch log shows: Benchmark:…
Vince
  • 593
  • 1
  • 5
  • 10
42
votes
9 answers

Can I test AWS Glue code locally?

After reading Amazon docs, my understanding is that the only way to run/test a Glue script is to deploy it to a dev endpoint and debug remotely if necessary. At the same time, if the (Python) code consists of multiple files and packages, all except…
lfk
  • 2,423
  • 6
  • 29
  • 46
41
votes
4 answers

How to create AWS Glue table where partitions have different columns? ('HIVE_PARTITION_SCHEMA_MISMATCH')

As per this AWS Forum Thread, does anyone know how to use AWS Glue to create an AWS Athena table whose partitions contain different schemas (in this case different subsets of columns from the table schema)? At the moment, when I run the crawler over…
rjmurt
  • 1,135
  • 2
  • 9
  • 25
39
votes
7 answers

How do I write messages to the output log on AWS Glue?

AWS Glue jobs log output and errors to two different CloudWatch logs, /aws-glue/jobs/error and /aws-glue/jobs/output by default. When I include print() statements in my scripts for debugging, they get written to the error log (/aws-glue/jobs/error).…
Jesse Clark
  • 1,150
  • 2
  • 13
  • 15
37
votes
4 answers

DynamicFrame vs DataFrame

What is the difference? I know that DynamicFrame was created for AWS Glue, but AWS Glue also supports DataFrame. When should DynamicFrame be used in AWS Glue?
Alex Oh
  • 411
  • 1
  • 4
  • 6
31
votes
3 answers

What is transformation_ctx used for in aws glue?

There are a lot of methods in API which received this with default "" value. Is it just string marker but again what it purpose?
Cherry
  • 31,309
  • 66
  • 224
  • 364
30
votes
6 answers

AWS Glue to Redshift: Is it possible to replace, update or delete data?

Here are some bullet points in terms of how I have things setup: I have CSV files uploaded to S3 and a Glue crawler setup to create the table and schema. I have a Glue job setup that writes the data from the Glue table to our Amazon Redshift…
krchun
  • 994
  • 1
  • 9
  • 19
28
votes
4 answers

Is AWS Lambda preferred over AWS Glue Job?

In AWS Glue job, we can write some script and execute the script via job. In AWS Lambda too, we can write the same script and execute the same logic provided in above job. So, my query is not whats the difference between AWS Glue Job vs AWS Lambda,…
john
  • 925
  • 1
  • 12
  • 20
26
votes
1 answer

AWS Glue Job Input Parameters

I am relatively new to AWS and this may be a bit less technical question, but at present AWS Glue notes a maximum of 25 jobs permitted to be created. We are loading in a series of tables that each have their own job that subsequently appends audit…
Sauron
  • 6,399
  • 14
  • 71
  • 136
25
votes
6 answers

Could not find S3 endpoint or NAT gateway for subnetId

I am unable to connect AWS Glue with RDS VPC S3 endpoint validation failed for SubnetId: subnet-7e8a2. VPC: vpc-4d2d25. Reason: Could not find S3 endpoint or NAT gateway for subnetId: subnet-7ea32 in Vpc vpc-4d225.
24
votes
5 answers

AWS Glue: How to handle nested JSON with varying schemas

Objective: We're hoping to use the AWS Glue Data Catalog to create a single table for JSON data residing in an S3 bucket, which we would then query and parse via Redshift Spectrum. Background: The JSON data is from DynamoDB Streams and is deeply…
24
votes
3 answers

What actions does job.commit perform in aws glue?

Every job script code should be ended with job.commit() but what exact action this function do? Is it just job end marker or not? Can it be called twice during one job (if yes - in what cases)? Is it safe to execute any python statement after…
Cherry
  • 31,309
  • 66
  • 224
  • 364
23
votes
3 answers

Overwrite parquet files from dynamic frame in AWS Glue

I use dynamic frames to write a parquet file in S3 but if a file already exists my program append a new file instead of replace it. The sentence that I use is this: glueContext.write_dynamic_frame.from_options(frame = table, …
Mateo Rod
  • 544
  • 2
  • 6
  • 14
22
votes
4 answers

At least one security group must open all ingress ports. AWS Glue connecting to RDS

I am still starting out with AWS Glue and I am trying to connect it to my publicly accessible MySql database hosted on RDS Aurora to get its data. So I start by creating a crawler and in the data store I create a new connection as in the screenshot…
Naguib Ihab
  • 4,259
  • 7
  • 44
  • 80
22
votes
6 answers

Can we consider AWS Glue as a replacement for EMR?

Just a quick question to clarify from Masters, since AWS Glue as an ETL tool, can provide companies with benefits such as, minimal or no server maintenance, cost savings by avoiding over-provisioning or under-provisioning resources, besides running…
Yuva
  • 2,831
  • 7
  • 36
  • 60
1
2 3
99 100