2

We have ETL jobs i.e. a java jar(performs etl operations) is run via shell script. The shell script is passed with some parameters as per the job being run. These shell scripts are run via crontab as well as manually depending on the requirements. Sometimes there is need of running some sql commands/scripts on posgresql RDS DB too, before the shell script run.

We have everything on AWS i.e. Ec2 talend server, Postgresql RDS, Redshift, ansible etc. How can we automate this process? How to deploy and handle passing custom parameters etc. Pointers are welcome.

ExploringApple
  • 1,348
  • 2
  • 17
  • 30
  • I always use Airflow for complex schedules or just a simple ec2 server with a cron job setup for simple situations – Jon Scott Jun 28 '18 at 14:04
  • It's just not about the scheduling. We have 3-4 ETL developers and have to schedule 5-8 jobs daily by the operations team. I am looking for a platform to reduce the heavy lifting. – ExploringApple Jun 28 '18 at 15:33

2 Answers2

3

I would prefer to go with AWS Data pipeline, and add steps to perform any pre / post operations on your ETL job, like running shell scripts, or any hql etc.

AWS Glue runs on Spark engine, and it has other features as well as such AWS Glue Development Endpoint, Crawler, Catalog, Job schedulers. I think AWS Glue would be ideal if you are starting afresh, or plan to move your ETL to AWS Glue. Please refer here on price comparison.

AWS Pipeline: For details on AWS Pipeline

AWS Glue FAQ:For details on supported languages for AWS Glue

Please note according to AWS Glue FAQ:

Q: What programming language can I use to write my ETL code for AWS Glue?

You can use either Scala or Python.

Edit: As Jon scott commented, Apache Airflow is another option for job scheduling, but I have not used it.

Joe Harris
  • 13,671
  • 4
  • 47
  • 54
Yuva
  • 2,831
  • 7
  • 36
  • 60
0

You can use Aws Glue for performing serverless ETL. Glue also has triggers which lets you automate their jobs.

Kishore Bharathy
  • 441
  • 1
  • 3
  • 11
  • Have you used it? How the functionality in jar file is used in Glue. I saw this but don't have clear idea how to use our current functionality with Glue. – ExploringApple Jun 28 '18 at 15:35
  • Glue can do the ETL functionalities mentioned in their documentation. You'll have to write you ETL code with Python or Scala. – Kishore Bharathy Jun 28 '18 at 15:58
  • And only native python, no pandas, numpy libraries supported CURRENTLY. It is in AWS' to do list, but no ETA as of now. – Yuva Jun 28 '18 at 16:31
  • It also depends on the language used in your existing ETL jobs, if native python or scala yes, otherwise it would not work. If its simple transformation and move to target, Glue can help you with generation of code, which can be customized as well. As I said, it depends upon the complexity, and nature of your existing code, the timeline you have for code modifying if allowed. – Yuva Jun 28 '18 at 17:04