After reading Amazon docs, my understanding is that the only way to run/test a Glue script is to deploy it to a dev endpoint and debug remotely if necessary. At the same time, if the (Python) code consists of multiple files and packages, all except the main script need to be zipped. All this gives me the feeling that Glue is not suitable for any complex ETL task as development and testing is cumbersome. I could test my Spark code locally without having to upload the code to S3 every time, and verify the tests on a CI server without having to pay for a development Glue endpoint.
-
1@Ifk - Have you been able to figure this out? I am working on the Glue and testing the code on dev-endpoints. I am looking for better alternative? – Deep Apr 19 '18 at 07:34
-
There didn't seem to be a better alternative. I decided against using Glue at the end – lfk Apr 19 '18 at 07:43
-
1The Zeppelin workflow mentioned by Yuva still seems to be the way to go as at Aug 2018, it seems unlikely an IDE based experience will be available any time soon without some sort of publicly available runtime to build/test against locally. If your primary use case for glue is the sources and sinks and your actual ETL can be written in spark it may be worth looking at building a spark ETL locally, deploying as a jar and leaving your Glue script as a 'dumb' wrapper which just feeds/collects data from the ETL job. – Kyle Sep 07 '18 at 12:00
9 Answers
Eventually, as of Aug 28, 2019, Amazon allows you to download the binaries and
develop, compile, debug, and single-step Glue ETL scripts and complex Spark applications in Scala and Python locally.
Check out this link: https://aws.amazon.com/about-aws/whats-new/2019/08/aws-glue-releases-binaries-of-glue-etl-libraries-for-glue-jobs/

- 1,318
- 1
- 16
- 33
-
-
Yes, but only after disabling Hive Support (as of the non-accepted answer here: https://stackoverflow.com/a/45545595/3080611 ). Then I reran the bin/setup.py again from the aws glue repo to build the jars using Maven. – Brian Sep 05 '19 at 06:02
You can keep glue and pyspark code in separate files and can unit-test pyspark code locally. For zipping dependency files, we wrote shell script which zips files and upload to s3 location and then applies CF template to deploy glue job. For detecting dependencies, we created (glue job)_dependency.txt file.

- 2,054
- 9
- 14
I spoke to an AWS sales engineer and they said no, you can only test Glue code by running a Glue transform (in the cloud). He mentioned that there were testing out something called Outpost to allow on-prem operations, but that it wasn't publically available yet. So this seems like a solid "no" which is a shame because it otherwise seems pretty nice. But with out unit tests, its no-go for me.

- 9,322
- 7
- 62
- 82
-
It doesn't seem to be suitable for production, business-critical tasks. I think it's mainly aimed at data scientists to run ad-hoc jobs and analytics. Nevertheless our AWS consultant tried really hard to convince us to use Glue instead of Spark on EMR. – lfk Jan 08 '19 at 23:46
-
1
There is now an official docker from AWS so that you can execute Glue locally: https://aws.amazon.com/blogs/big-data/building-an-aws-glue-etl-pipeline-locally-without-an-aws-account/
There's a nice step-by-step guide on that page as well

- 868
- 1
- 10
- 27
-
A thorough guide by AWS: https://aws.amazon.com/blogs/big-data/develop-and-test-aws-glue-version-3-0-jobs-locally-using-a-docker-container/ – selle Sep 29 '22 at 10:43
Not that I know of, and if you have a lot of remote assets, it will be tricky. Using Windows, I normally run a development endpoint and a local zeppelin notebook while I am authoring my job. I shut it down each day.
You could use the job editor > script editor to edit, save, and run the job. Not sure of the cost difference.

- 1,157
- 9
- 23
I think the key here is to define what kind of testing do you want to do locally. If you are doing unit testing (i.e. testing just one pyspark script independent of the AWS services supporting that script) then sure you can do that locally. Use a mocking module like pytest-mock, monkeypatch or unittest to mock the AWS and Spark services external to your script while you test the logic that you have written in your pyspark script.
For module testing, you could you a workbook environment like AWS EMR Notebooks, Zeppelin or Jupyter. Here you would be able to run your Spark code against test datasources, but you can mock the AWS Services.
For integration testing (i.e. testing your code integrated with the services it depends on, but not a production system) you could launch a test instance of your system from your CI/CD pipeline and then have compute resources (like pytest scripts or AWS Lambda) automate the workflow implemented by your script.

- 21
- 1
Adding to CedricB,
For development / testing purpose, its not necessary to upload the code to S3, and you can setup a zeppelin notebook locally, have an SSH connection established so you can have access to the data catalog/crawlers,etc. and also the s3 bucket where your data resides.
After all the testing is completed, you can bundle your code, upload to an S3 bucket. Then create a Job pointing to the ETL script in S3 bucket, so that the job can be run, and scheduled as well. Once all the development/testing is completed, make sure to delete the dev endpoint, as we are charged even for the IDLE time.
Regards

- 2,831
- 7
- 36
- 60
-
Worth noting that when Glue compiles your Scala job it may be a little different to the spark shell in a dev endpoint (ie, at the very least warnings are treated as fatal, which is not the case in the spark-shell). – Kyle Sep 07 '18 at 12:04
You can do this as follows:
Install PySpark using
>> pip install pyspark==2.4.3
Prebuild AWS Glue-1.0 Jar with Python dependencies: Download_Prebuild_Glue_Jar
Copy the awsglue folder and Jar file into your pycharm project from github
Copy the Python code from my git repository
Run the following on your console; make sure to enter your own path:
>> python com/mypackage/pack/glue-spark-pycharm-example.py
From my own blog

- 658
- 1
- 8
- 19
-
5Next time, when linking to your own blog, make it very, very clear it is **your** blog. Otherwise you run the risk of it being deleted as spam. – Adriaan Nov 19 '19 at 09:41
If you are looking to run this in docker here is a link
Docker Hub : https://hub.docker.com/r/svajiraya/glue-dev-1.0
Git Repo for dockerfile
https://github.com/svajiraya/aws-glue-libs/blob/glue-1.0/Dockerfile

- 156
- 3
-
Could you explain how can be Docker used to launch local glue scripts? Or maybe point us to some documentation about it? Thanks! – Servadac May 19 '20 at 13:35
-
Those are unoffical dockers. There's an official one as well: https://aws.amazon.com/blogs/big-data/building-an-aws-glue-etl-pipeline-locally-without-an-aws-account/ – selle Nov 23 '20 at 17:52