0

I've been tasked to crawl urls and abruptly stop the AWS Lambda after 30 seconds after which the number of URLs crawled will be taken as an evaluation metric. In a simple architecture where I have 1 lambda that takes a file and loops through it and writes to the database, I could have simply asked it to timeout after 30 seconds. For the sake of my learning and to meet the other criteria of scaling, the architecture I adopted is this:

enter image description here

So even if I timeout my lambdas, they will run again given the URLs are being sent as events by fargate. The point of fargate is to be able to download a huge file as Lambda has limitations. The use of events will then help me achieve scale by simply allowing more concurrent lambdas. Can I somehow stop eventbus to freeze sending or receiving notifications after 30 seconds? Can I stop all the computes somehow?

I could send errors etc (I only get timeout errors in the past) to a dead queue or an SNS topic and show resiliency in the system to abrupt crashes. I could also demonstrate the number of URLs crawled by showing logs. But assume these measures do not satisfy the evaluator, is there anything I can do?

I can add delays to messages and queues but how would that do anything? I can't add a delay after a certain time period. That would have worked.

  • Potential [XY Problem](https://xyproblem.info/). What requirement are you actually trying to solve? – jarmod Nov 16 '21 at 15:07
  • "PROJECT EVALUATION2 The project will be evaluated on two parameters – scalability and performance. These have been separated into two stages. 1. SUPER SCALER  A list of URLs to be crawled will be provided on the day of evaluation. The crawler will be allowed to execute for 30 seconds. The participant whose crawler crawls the maximum number of URLs will win the Super Scaler" – Muhammad Mubashirullah Durrani Nov 16 '21 at 15:12
  • Are you sure that *you* have to implement the restriction that the "crawler will be allowed to execute for 30 seconds"? If I were running a performance evaluation, I wouldn't allow the vendors to control this. I would control it (somehow). – jarmod Nov 16 '21 at 15:31
  • @jarmod this is a homework assignment, so it would make sense that they're delegating implementing the limitation to students. – gshpychka Nov 16 '21 at 15:41
  • Dear @jarmod, I am bit confused by your comment. How would you control it? We will show the judges our code and the timeout=core.Duration.seconds(30) parameter of Function. Maybe this really is an XY problem. I'll go with the original simple solution and if they give me a bigger file that lambda can't download, I'll simply show them this solution. – Muhammad Mubashirullah Durrani Nov 16 '21 at 15:42
  • Not technically homework. I would not ask if it were the case. – Muhammad Mubashirullah Durrani Nov 16 '21 at 15:42
  • How would I control it? One way would be for the list of test URLs to all be *my* URLs on *my* servers and all the embedded links would be likewise, so that I control the content sent to the test apps. That way I can determine exactly how many HTTP requests came from a particular source over a given period of time, what the aggregate response size was, whether or not a given test client actually requested all of the URLs embedded within a test page, etc. etc. Alternatively, I would ask the testers to persist every page they read onto a file system that I could then query the timestamps of. – jarmod Nov 16 '21 at 16:02
  • We have to use an X-ray on our code. Additionally, the database we make will be queried for results. A word will be given and the resulting links that had that word need to be printed. – Muhammad Mubashirullah Durrani Nov 16 '21 at 16:10
  • You mentioned a concern about not being able to download a large file in Lambda. It's true that Lambda has a limited /tmp space of 512MB but you don't need to download to disk. You could, and likely should, simply store in memory. And, if Lambda is going to be the engine of your solution, be sure to configure large RAM size as this will give you correspondingly more CPU and network bandwidth. You might also consider using a lightweight runtime like Go (goroutines) or Node.js. Also, perhaps splitting up the list of URLs and fan subsets out to multiple Lambda functions, running concurrently. – jarmod Nov 16 '21 at 16:20
  • @jarmod I believe your last point is what they're already doing with the ECS service. – gshpychka Nov 16 '21 at 16:21
  • Love the discussion guys. So many beautiful ways to go about it. From this link: https://github.com/cdk-patterns/serverless/blob/main/the-eventbridge-etl/README.md I chose to use a Fargate container to download the file from s3 rather than using Lambda. For the small bundled test data csv Lambda would have worked but I felt it would be misleading and suggestive that you could pull larger files down onto a Lambda function.... – Muhammad Mubashirullah Durrani Nov 16 '21 at 16:34
  • ... Lambda functions have a few limitations around memory, storage and runtime. You can do things like partially stream files from s3 to Lambda (if they happen to be in the right format) and then store state somewhere between timeouts but I felt that having an ECS Task that you can define CPU, RAM and Disk Space was the much more flexible way to go and being Fargate you are still on the serverless spectrum. You can see how cheap Fargate is if you go into the cost breakdown in Hervé's GitHub repo. – Muhammad Mubashirullah Durrani Nov 16 '21 at 16:34
  • @MuhammadMubashirullahDurrani check out my answer. You cannot stop a running lambda, but you can prevent more from invoking. – gshpychka Nov 16 '21 at 16:49
  • 1
    @gshpycha It's amazing and would have not occurred to me. You've saved me a week's effort. Thank you <3 – Muhammad Mubashirullah Durrani Nov 16 '21 at 16:54

1 Answers1

1

One way would be to disable the Event Rule after a set period of time using the cli or SDK from your container:

$ aws events disable-rule --name MyRule --event-bus-name MyEventBus

Another option to stop all further lambda invocation is to set the concurrency limit to 0 as per How to kill/terminate a running AWS Lambda function? :

$ aws lambda put-function-concurrency --function-name my-function --reserved-concurrent-executions 0

This will not stop the executions that are already running - that cannot be done with Lambda.

Your database can then be filtered by the insertion timestamp to remove all writes that happen after the threshold.

gshpychka
  • 8,523
  • 1
  • 11
  • 31