2

My current Situation:

I currently have a Python script that fetches data via HTTP endpoints and calculates and generates hundreds/thousands of reports daily. Currently it runs on an AWS EC2 instance where a queue is used to split the reports it needs to generate across four threads. Four at a time, the script fetches data, computes each report, and saves it to a PostgreSQL Amazon RDS.

The Problem:

As the project scales, my script won't be able to compute fast enough and won't be able to generate all the reports it needs in a day with the current method.

Looking For a Solution:

I stumbled across Amazon Lambda but I haven't found anyone using it for a use case similar to mine. My plan would be to upload/put each report needed to be generated into it's own S3 bucket then have the Lambda Function trigger when the bucket is created. The Lambda function would do all the data fetching (from HTTP endpoints) and all the calculations and save it to a row in my PostgreSQL Amazon RDS. In theory, this would make everything parallel and would eliminate the need for a queue waiting for resources to be freed up.

Basically I am looking for a solution to make sure my script is able to run daily and finish each day without over-running into the next day.

My Questions:

Would Amazon Lambda be suitable for something like this?

Would it be costly to do something like this with Amazon Lambda (creating hundreds/thousands of s3 buckets a day)?

Is there better options?

Any help, recommendations, insight, or tips is greatly appreciated. Thanks!

Kyle Asaff
  • 274
  • 3
  • 13

3 Answers3

4

Would Amazon Lambda be suitable for something like this?

  • You cannot run for longer than 5 minutes.
  • Deployment (especially when you have many external libraries) is a bit clunky
  • You have little control over how AWS runs your code (there could be delays or pauses, logs are harder to get at)

If these are not serious concerns for you, I think your problem sounds like a good fit.

Would it be costly to do something like this with Amazon Lambda (creating hundreds/thousands of s3 buckets a day)?

See the Lambda Pricing and S3 Pricing.

Creating thousands of buckets per day sounds like a bad idea (and may not be permitted by AWS). By default, you can have 100 buckets in your account and every bucket name is global (to that region). Maybe you meant thousands of keys within one bucket?

It all comes down to the size of your reports, the time and memory needed to create them, and the frequency with which they are fetched from AWS (that's when you pay for the data transfer). AWS has a cost calculator though it's a bit of a pain so you may prefer to just figure it out yourself from their pricing pages.

Is there better options?

If your reports generate almost constantly, you are probably better off continuing to run the server yourself. If you get very large batches occasionally, you may be better off bidding on spot instances or looking at other cloud service providers. If you get irregular bursts throughout the day then Lambda seems like an excellent fit for you.

Nathaniel Waisbrot
  • 23,261
  • 7
  • 71
  • 99
  • 1
    There's also the 1.5MiB limit on memory, which has to be allocated in advance and the more you allocate, the more it costs, whether you need it or not. I know some of my reports, which generate xlsx files, actually need more memory that this, to render. D'oh. +1 for spot instances -- this appears to be an ideal use case. – Michael - sqlbot Mar 30 '16 at 10:58
  • thanks this is insight I was looking for! My reports do generate almost constantly so I think you are right that I might be better off continuing to run the server myself. I wanted to make sure I fully evaluated Lambda for my use case before spending time and resources in an attempt to switch the project over to Lambda. – Kyle Asaff Mar 30 '16 at 17:44
1

@Nathaniel had answered most of the questions, but I'd add to "other options" point:

If you can run more reports in parallel than four from the source point of view (you just limit them to four due to CPU utilization not because HTTP services cannot handle higher load) then I'd definitely say you can do more things:

  1. Rewrite your reports to use async IO so that you can make use of the time when HTTP requests are blocked. This can increase your throughput.

  2. Get an instance with more CPUs, and run your script with more threads. For the task you are having, I'd say you can run with CPU*4 threads at least, may be more - monitor CPU utilization and increase number of threads until you achieve good user CPU utilization.

  3. Do self-clustering - put the script to run on a startup of an instance and kill instance when there is no job done, create a few of those using spot prices, and watch them do the job

  4. If you don't mind switching languages, you can use explicit clustering and message-based scheduling - like Akka or Storm.

Community
  • 1
  • 1
denismo
  • 760
  • 4
  • 8
  • 3. I do this with persistent spot requests. They go away if the price increases above the threshold and come back when it drops again. Finding the right instance class in the right zones in the right region (spot prices can vary widely across availability zones in the same region for the same exact instance class) can sometimes be such a bargain that even running it almost all the time means so much more power for less money that running all the time doesn't matter, perfect for jobs that are not time critical. – Michael - sqlbot Mar 30 '16 at 11:08
1

AWS Big Data blog introduced an architecture to parallel process large amounts of files in S3 with Lambda. The prototype level implementation is in Node.js, but the architecture is not language dependent. This solution assumes you can distribute the processing.

To summarise:

  1. The idea is to run one EC2 machine that takes in report requests. This machine creates a list of source files in S3 (http endpoints perhaps in your case) which it distributes to first level of Lambda functions.
  2. Each of these functions distribute tasks to leaf level worker functions.
  3. All the results are aggregated back to the EC2 machine which streams the results to user in real time (RDS in your case perhaps).

Your use case is different but the article shows a simple way of running enormous analysis tasks in parallel in a very short time.

The prototype implementation presented is missing several obvious features needed in production and thus can only be used a demo. Do also take a look at the author's excellent Re:invent presentation linked in the comments.

h-kippo
  • 429
  • 1
  • 5
  • 10