5

I have series of ~10 queries to be executed every hour automatically in Redshift (maybe report success/failure).

Most queries are aggregation on my tables.

I have tried using AWS Lambda with CloudWatch Events, but Lambda functions only survive for 5 minutes max and my queries can take up to 25 minutes.

John Rotenstein
  • 241,921
  • 22
  • 380
  • 470
Rahul Gupta
  • 1,744
  • 2
  • 17
  • 28

6 Answers6

4

i had the same problem in the past,

you can use R or Python for that.

i used R , you can install package RpostgreSQL and connecting to your Redshift attached example:

drv <- dbDriver("PostgreSQL")
conn <-dbConnect(drv,host='mm-stats-1.ctea4hmr4vlw.us-east-1.redshift.amazonaws.com',port='5439',dbname='stats',user='xxx',password='yyy')

and then you can build report with markdown and then scheduled it with crontab task.

also i used mailR package to send the report to other users

user3600910
  • 2,839
  • 4
  • 22
  • 36
3

It's kind of strange that AWS doesn't provide a simple distributed cron style service. It would be useful for so many things. There is SWF, but the timing/scheduling aspect is left up to the user. You could use Lambda/Cloudwatch to trigger SWF events. That's a lot of overhead to get reasonable cron like activity.

Like the comment says the easiest way would be to run a small instance and host cron jobs there. Use an autoscale group of 1 for some reliability. A similar but more complicated approach is to use elastic beanstalk.

If you really want redundancy, reliability, visibility, etc. it might be worth looking at a third party solution like Airflow. There are many others depending on your language of preference.

Here's a similar question with more info.

Community
  • 1
  • 1
systemjack
  • 2,815
  • 17
  • 26
1

You can use Data Pipeline to do that, although I think it's on an end-of-life path since they haven't released any new features to the service in a while and the GUI is pretty archaic and difficult to work with. The main benefit of using Data Pipeline over Lambda is that Lambda functions can only run for a maximum of 15 minutes, whereas Data Pipeline can track the status of the query until it's complete.

stanimal21
  • 43
  • 5
1

Good news is that Redshift recently started to support that. See info and some more details about it.

Adam Tokarski
  • 659
  • 4
  • 13
0

I created an AWS Lambda function to connect Amazon Redshift database using .NET and PostgreSQL packages And then using the Event Rules within CloudWatch dashboard I created a schedule to execute AWS Lambda periodically

This works fine for me but the first part where creating an Lambda function that will connect to Redshift takes some time. It is better to use a Layer which includes required libraries and packages for PostgreSQL connections. (Since Redshift is based on Postgre it supports those drivers too)

One last solution is using an other database server which connect to Redshift For example, I am running a SQL Server with a LinkedServer connection to Redshift database. So actually I use the SQL Server Agent for scheduling SQL jobs that connects to Redshift

Eralper
  • 6,461
  • 2
  • 21
  • 27
-1

use aws lambda to run your script. you can schedule it. see https://docs.aws.amazon.com/lambda/latest/dg/with-scheduled-events.html

this uses CloudWatch events behind the scenes. If you do it from the console, it will set things up for you.

Radu Simionescu
  • 4,518
  • 1
  • 35
  • 34