What is the best practice for handling unexpected errors in an AWS Step Function using Python Lambdas?

Question

Step Functions are AWS structures that control the flow of lambdas (or other events). All my lambdas use Python (but Lambdas can use most major languages). Throughout the process my step function sends status updates back to the client (the client triggered it via API). Let's say it progresses through these updates: Started -> In Progress -> Finishing -> Done. For handled errors it will send an 'Error' status back to the client. So the client could see a timeline like this: Started -> In Progress -> Errored. This is ideal - so the user knows the process has stopped.

But when there are unexpected/unhandled errors the client never really knows and the timeline might sit at 'In Progress' indefinitely - the user doesn't know what happened. So I started looking into the built-in Step Function error handling. I like this option because I can create a 'Catch' function for each lambda or event where I can communicate back to the client if there is an error. The downside to this was that it really made the step function template/design messy see the before/after screenshots below.

BEFORE---------------

AFTER---------------

The template code that generates these graphs doesn't look much better. So I considered an alternative which seems similarly messy. I could add a single try/except block within each lambda for the entire lambda - to catch any/all errors. For example:

def lambda_handler(event, context):
    try:
        #Execute function tasks
    except:
        #Communicate back to client that there was an error

Similar to the step function 'Catch' functions this would ensure that I catch and communicate any error. But this seems like a bad idea just because of what it is (adding blanket/blind try/except).

So right now I'm stuck between messy/repeated code and try/except-ing everything. Am I implementing step function 'Catch' incorrectly? Am I missing a better way to handle unknown Python errors? Is there another approach entirely?

score 1 · Answer 1 · answered Aug 05 '21 at 20:13

1

I don't see why having a try-catch system for the entirety of your lambda is such a bad idea. It just ensures that you're always in control of how errors are communicated to the caller of the lambda function.

Imagine for example a lambda that serves as a back-end for an HTTP API, it would be better practice to have an try-catch for everything, so you can communicate to your clients what the problem was, or at least provide a generic HTTP 500 type error. In this case, the functions will be called by AWS Step Functions, which means you're error messages don't have to be user friendly, but the fact you might want to be in control of how unexpected exceptions are handled, is still the same in my book.

answered Aug 05 '21 at 20:13

stijndepestel

3,076
2
18
22

I like that logic @stijndepestel. My only concern would be that on the backend I can still get all visibility to the error that I would need. Could you provide some code that I could print() to console or send as a message so that I could resolve. I haven't had success in the past at printing out the full stack trace, for example. – Liam Hanninen Aug 05 '21 at 20:19
That hugely depends on the framework/language that you're running in. I have not enough Python experience to know this by heart, but I presume python exception/error object will have some information that you can print to the CloudWatch logs. – stijndepestel Aug 05 '21 at 20:22
This is what I do for my Python Lambdas. Exceptions can be printed to CloudWatch Logs, and there you can see what ever was passed to the exception, along with the traceback, which especially useful. – Aug 05 '21 at 20:31

score 1 · Accepted Answer · edited Aug 05 '21 at 20:44

1

As @stijndepestel pointed out, having a catch-all error check is a good idea.

What I do in my Python Lambda functions is this: I have a custom router class, which besides route managing, it handles all errors. If the error inherits from a base error class that I've created, then it's custom error that I threw, and those are assigned special info when I created them that automatically gets formatted when they are converted into strings. The router sends that back to the client if possible.

But if the error is some unknown/unexpected one, then the router prints it with as much detail as possible to CloudWatch Logs, and then returns a generic "500 Internal Server Error" message to the client.

I'd probably set it up in the future to notify me by email or something like that when such errors occur, so that I can take action quickly.

edited Aug 05 '21 at 20:44

Dharman

30,962
25
85
135

answered Aug 05 '21 at 20:39

Very cool @Hcaertnit. How do you pass the stack trace and additional info from the exception? Can you provide some general code. Or some parts of your custom router class that handle unknown/unexpected errors? – Liam Hanninen Aug 05 '21 at 20:55
Unfortunately @Liam, there's no good way to get much more info from an unknown exception then its stack trace, which you can do easily [here](https://stackoverflow.com/a/62448533/16442705). Are you thinking of anything particular? – Aug 05 '21 at 21:11
1

That should do the trick actually. I'll mark this correct soon if I don't get any AWS-native solutions and/or if I go with this solution anyway. – Liam Hanninen Aug 05 '21 at 21:34

What is the best practice for handling unexpected errors in an AWS Step Function using Python Lambdas?

2 Answers2