22

Just a quick question to clarify from Masters, since AWS Glue as an ETL tool, can provide companies with benefits such as, minimal or no server maintenance, cost savings by avoiding over-provisioning or under-provisioning resources, besides running on spark, I am looking for some clarifications, if AWS Glue can replace EMR?

If both can co-exist, how EMR can play a role along with AWS Glue?

Thanks & regards

Yuva

Yuva
  • 2,831
  • 7
  • 36
  • 60
  • Glue is good for ETL work. If your only using EMR to run ETL jobs than Glue would be a great replacement. However you can also use EMR to run custom algorithms, machine learning etc on your data. You can't do that with Glue, hence think of EMR as a more complex but lot more flexible service. – niczky12 Jan 12 '18 at 09:11

6 Answers6

10

As per my understanding, glue cannot be a replacement for EMR. It actually depends on your usecase. There are some limitations with glue ETL;

  • It does not support --packages.
  • You do not have an internal storage for storing temp data.

With glue catalog you can view data in Athena, but it also has few limitations like cannot create table as select, cannot create view etc. You can use Glue data catalog in EMR to overcome limitations of Athena.

So, currently glue can be a replacement for persistent metadata store.

Ashutosh
  • 347
  • 4
  • 11
  • Thank you for your views yalcinmn1 & ashutoshs. I am working on evaluating the AWS Glue capabilities as against that of an EMR. Thus far, I was able to set up an local zeppelin instance, connect to an AWS Glue to run my ETL code, and finally store the data to a AWS Redshift cluster using a jdbc connections. Still working on the evaluations. Thanks – Yuva Jan 19 '18 at 17:07
  • I got an issue storing the result of a query in redshift. If I cant hook you up to ask you something about the connection it would be great. thanks – Andres Urrego Angel May 11 '18 at 20:44
  • AWS Glue is not a direct replacement for EMR. Both are there for different purposes. 1) Glue is only for ETL purpose and for business use cases with transient data. EMR can be used for operations other then ETL like ML, data storage in hive, presto, zeppelin, etc. 2) Glue is costlier then EMR due to it's server less nature, but with EMR needing to be operational 24 x 7, so it will be case-2-case different. and there are many more.... :) – abhijitcaps Aug 22 '22 at 14:48
7

AWS Glue does not let us configure a lot of things like executor memory or driver memory. It is a fully managed service with 5Gb as the default driver memory and 5Gb as the default executor memory. On the other hand, AWS EMR is not a fully managed service, which requires us to configure. Better for experienced engineers.

prabhugs
  • 742
  • 7
  • 20
3

My experience with Glue so far didn't provide any significant advantages compared with EMR. Besides, I saw couple of limitations in Glue such as libraries, temp storing etc. Besides, although Glue is sitting on Spark, it does not behave the same as core Spark; for example reading 1-row CSV files, ignoring entire file if there is missing header, etc.

One thing I am still investigating whether Glue dynamically adjusts the cluster based on query load. If I cannot find something good, I think I will refer my company to shift EMR with more flexibility.

Josh Kodroff
  • 27,301
  • 27
  • 95
  • 148
ylcnky
  • 775
  • 1
  • 10
  • 26
  • auto-scaling is not available in glue till now. – Sandeep Fatangare Oct 11 '19 at 04:37
  • auto-scaling is not available, but you can set the maximum number of DPUs (ceiling value). Glue calculates the required capacity and uses only the required capacity, it can scale-up to the maximum configured DPU values.(Charges are only for used capacity, instead of maximum DPU capacity) – Anandkumar Oct 28 '20 at 06:49
  • Auto-scaling is available from AWS Glue 3.0, and is currently in preview stage. https://docs.aws.amazon.com/glue/latest/dg/auto-scaling.html – Yuva Jan 04 '22 at 16:00
  • "reading 1-row CSV files, ignoring entire file if there is missing header" - This is something which you will have to handle in pySpark – abhijitcaps Aug 22 '22 at 14:53
3

BTW, you can also config all the built-in configuration with passing the parameters to the Glue Job :
ex.

--conf value: spark.yarn.executor.memoryOverhead=1024   
--conf value: spark.driver.memory=10g  

This can help to make Glue Job more flexible.

taras
  • 6,566
  • 10
  • 39
  • 50
esinik
  • 49
  • 2
  • 2
    Point is since AWS Glue is fully managed, max memory limit is 16GB so there is limit on `spark.driver.memory` config you can set in AWS Glue. In EMR, you can decide cluster type as per your need and virtually, there is no limit on `spark.driver.memory` config in EMR – Sandeep Fatangare Oct 11 '19 at 04:41
3

EMR can act as "interactive" and "batch" data processing framework (EMR is hadoop framework). Glue is only "batch" mode data processing (ETL) framework (Spark ETL) with below additional capabilities.

Glue has many capabilities, some of them are 

 1.Glue Metadata catalog (Data Catalog - Database and tables) 
 2.Glue Crawler - Parse the data and create table definitions
 3.Glue Jobs - ETL
 4.Glue Workflows - Combined multiple ETL flow
 5.Glue  ML transforms - ML related transforms
 6. Glue devendpoints - for developing Glue jobs in Notebooks

Glue is serverless AWS service, which means you don't need to spend time on setting up the underlying servers and nodes. (Even though, behind the scene Glue uses EMR though). You can choose the cluster size with the Glue advanced configuration though (by picking DPU 1.X or 2.X and number of DPUs DPU- Data Processing Units) refer this link Configuring DPUs

To answer your question with a specific answer:

Glue cannot replace EMR, EMR has more functional capabilities than Glue.

You can think of EMR as "Hadoop framework with ecosystems(including spark)", and Glue as only "Spark ETL with Hive metastore capabilities"

yes, they both can co-exist. If they co-exist, Glue can act as ETL framework to source the data, transform and store in S3 and maintain table definition of that data set in "Glue Catalog". EMR can use/access that dataset from S3 using "EMRFS" and Glue Catalog. Using EMR ecosystems, you can analyze the data (with table definitions)

Anandkumar
  • 1,338
  • 13
  • 15
2

You can actually run regular Spark jobs "serverless" on AWS Glue. We are using AWS Glue as an auto-scale "serverless Spark" solution: jobs automatically get a cluster assigned from the managed AWS Spark cluster pool. The AWS Glue SDK and the Glue Catalog can be ignored and the auto-generated script can be replaced with regular Spark code. Dependencies can be packaged and pushed to S3.

However, the configuration options are limited. Scaling parameters are limited to the WorkerType and NumberOfWorkers, or the magic MaxCapacity. The cluster size does not automatically scale with files opened outside of the Glue SDK.

Example CloudFormation configuration snippet:

  MyJob:                                                                                                                                                                                                
    Type: "AWS::Glue::Job"                                                                                                                                                                                     
    Properties:                                                                                                                                                                                                
      Command:                                                                                                                                                                                                 
        Name: "glueetl"                                                                                                                                                                                        
        ScriptLocation: "SOME_S3_MAIN_CLASS_LOCATION"
      AllocatedCapacity: 3
      DefaultArguments:                                                                                                                                                                                        
        "--job-language": scala                                                                                                                                                                                
        "--class": some.class.path.inside.jar.MyJob                                                                                                                                           
        "--enable-metrics": true                                                                                                                                                                               
        "--extra-jars": "SOME_S3_JAR_LOCATION"

More configuration options can be found in the Glue CloudFormation docs: https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-resource-glue-job.html#cfn-glue-job-defaultarguments

Turiphro
  • 357
  • 1
  • 3
  • 11