0

I have a data normalization process that exists in python but now needs to scale. This process currently runs via a job-specific configuration file containing a list of transforming functions that need to be applied to a table of data for that job. The transforming functions are mutually exclusive and can be applied in any order. All transforming functions live in a library and only get imported and applied to the data when they are listed in the job-specific configuration file. Different jobs will have different required functions listed in the configuration for that job, but all functions will exist in the library.

In the most general sense, how might a process like this be handled by AWS Glue? I don't need a technical example as much as a high level overview. Simply looking to be aware of some options. Thanks!

KidMcC
  • 486
  • 2
  • 7
  • 17

1 Answers1

1

The single most important thing you need to consider when using AWS glue is that is a serverless spark-based environment with extensions. That means you will need to adapt your script to be pySpark-like. If you are OK with that, then you can use external python libraries by following the instructions at AWS Glue Documentation

If you already have your scripts running and you don't feel like using Spark, you can always consider the AWS Data Pipeline. It's a service to run data transforms in more ways than just Spark. On the downside, AWS Data Pipeline is Task-driven, not Data-driven, which means no catalog or schema management.

If you want to use AWS Data Pipeline with Python is not obvious when you read the documentation, but the process would be basically staging a shell file into S3 with the instructions to set up your python environment and to invoke the script. Then you configure scheduling for the pipeline and AWS will take care of starting the virtual machines whenever needed and stopping afterwards. You have a good post at stackoverflow about this

Javier Ramirez
  • 3,446
  • 24
  • 31
  • 1
    Thanks, Javier! Adapting the script to be pySpark-like is actually no issue at all. Also, there are some constraints that are going to force the choice of AWS Glue over the other Data Pipeline option you mentioned. So long as the utilization of a job-specific configuration file is workable in this framework, everything else should fall into place it seems. – KidMcC Feb 27 '19 at 17:47
  • hey @queb182 I just noticed we recently announced you can now run python shell jobs that don't need to be based on PySpark. You can also package extra dependencies if needed. I still haven't used this feature myself, but I thought it might be helpful for your use case. You can read more at [the AWS Glue docs](https://docs.aws.amazon.com/glue/latest/dg/add-job-python.html) – Javier Ramirez Mar 01 '19 at 12:08
  • 1
    Hi Javier, fyi, please note that python shell in Glue supports a max. of one DPU only, and hence i dont think we can get the benefits of spark - parallelism. If you are not worried about the performance, but need to run your existing python script, using other benefits of glue such as catalogs, job, scheduling,etc. you should be good to go. Still if you need any python libraries other than what is available in python shell, you may need to create an .egg file for the same. You must be knowing this i believe :), just thought will add to your inputs. Correct me if am wrong. – Yuva Mar 01 '19 at 15:48