Amazon Kinesis & AWS Lambda Retries

Question

I'm very new to Amazon Kinesis so maybe this is just a problem in my understanding but in the AWS Lambda FAQ it says:

The Amazon Kinesis and DynamoDB Streams records sent to your AWS Lambda function are strictly serialized, per shard. This means that if you put two records in the same shard, Lambda guarantees that your Lambda function will be successfully invoked with the first record before it is invoked with the second record. If the invocation for one record times out, is throttled, or encounters any other error, Lambda will retry until it succeeds (or the record reaches its 24-hour expiration) before moving on to the next record. The ordering of records across different shards is not guaranteed, and processing of each shard happens in parallel.

My question is, what happens if for some reason some malformed data gets put onto a shard by a producer and when the Lambda function picks it up it errors out and then just keeps retrying constantly? This then means that the processing of that particular shard would be blocked for 24 hours by the error.

Is the best practice to handle application errors like that by wrapping the problem in a custom error and sending this error downstream along with all the successfully processed records and let the consumer handle it? Of course, this still wouldn't help in the case of an unrecoverable error that crashed the program like a null pointer: again we'd be back to the blocking retry loop for the next 24 hours.

score 40 · Accepted Answer · answered Sep 11 '15 at 06:33

40

Don't overthink it, the Kinesis is just a queue. You have to consume a record (ie. pop from the queue) successfully in order to proceed to the next one. Just like a FIFO stack.

The appropriate approach should be:

Get a record from stream.
Process it in a try-catch-finally block.
If the record is processed successfully, no problem. <- TRY
But if it fails, note it down to another place to investigate the reason why it failed. <- CATCH
And at the end of your logic blocks, always persist the position to DynamoDB. <- FINALLY
If an internal occurs in your system (memory error, hardware error etc) that is another story; as it may affect processing all of the records, not just one.

By the way, if processing of a record takes more than 1 minute, it is obvious you are doing something wrong. Because Kinesis is designed to handle thousands of records per second, you should not have the luxury of processing such long jobs for each of them.

The question you are asking is a general problem of queue systems, sometimes called "poisonous message". You have to handle them in your business logic to be safe.

http://www.cogin.com/articles/SurvivingPoisonMessages.php#PoisonMessages

answered Sep 11 '15 at 06:33

az3

3,571
31
31

1

Sounds reasonable but just a quick question about the DynamoDb bit, why do I need to persist the position (I presume you mean the sequence number)? – Stefano Sep 11 '15 at 16:05
1

Because when you stop a "Kinesis Consumer Application" node and start later; you should be able to continue from the last point you were. – az3 Sep 15 '15 at 07:34
Ah yes, that makes sense. – Stefano Sep 15 '15 at 16:31
Both answers are good and say similar things but I'm going to give the answer to @az3 because he answered first. – Stefano Sep 28 '15 at 18:34
In worker.java , it calls runProcessLoop and in that it calls shardConsumer.consumeShard() there it calls checkAndSubmitNextTask() in that it checks readyForNextTask or not . If notReady it does not consumer new records . So how is it possible worker retrieves new records without recordprocessor process previous ones. – user1846749 Jan 24 '17 at 00:05

Guy · Answer 2 · 2015-09-14T16:00:30.793

This is a common question on processing events in Kinesis and I'll try to give you some points to build your Lambda function to handle such issues with "corrupted" data. Since it is best practice to have separated parts of your system writing to the Kinesis stream and other parts reading from the Kinesis stream, it is common that you will have such problems.

First, why do you have such problematic events?

Using Kinesis to process your events is a good way to break up a complex system that is doing both front-end processing (serving end users), and at the same time/code back-end processing (analyzing events), into two independent parts of your system. The front-end people can focus on their business, while the back-end people don't need to push code changes to the front-end, if they want to add functionality to serve their analytic use cases. Kinesis is a buffer of events that both breaks the need for synchronization as well simplifies the business logic code.

Therefore, we would like events written to the stream to be flexible in their "schema", and if the front-end teams wish to change the event format, add fields, delete fields, change the protocol or the encryption keys, they should be able to do that as often as they want.

Now it is up to the teams that are reading from the stream to be able to process such flexible events in an efficient way, and not break their processing every time such change is happening. Therefore, it should be common that your Lambda function will see events that it can't process, and "poison-pill" is not that rare event as you might expect.

Second, how do you handle such problematic events?

Your Lambda function will get a batch of events to process. Please note that you shouldn't get the events one by one, but in large batches of events. If your batches are too small, you will quickly get large lags on the stream.

For each batch you will iterate over the events, process them and then check-point in DynamoDB the last sequence-id of the batch. Lambda is doing most of these steps automatically with (see more here: http://docs.aws.amazon.com/lambda/latest/dg/walkthrough-kinesis-events-adminuser-create-test-function.html):

console.log('Loading function');

exports.handler = function(event, context) {
    console.log(JSON.stringify(event, null, 2));
    event.Records.forEach(function(record) {
        // Kinesis data is base64 encoded so decode here
        payload = new Buffer(record.kinesis.data, 'base64').toString('ascii');
        console.log('Decoded payload:', payload);
    });
    context.succeed();
};

This is what is happening in the "happy path", if all the events are processed without any problem. But if you encounter any problem in the batch and you don't "commit" the events with the success notification, the batch will fail and you will get all the events in the batch again.

Now you need to decide what is the reason of the failure in the processing.

Temporary problem (throttling, network issue...) - it is OK to wait a second and try again for a couple of times. In many cases the issue will resolve itself.
Occasional problem (out of memory...) - it is best to increase the memory allocation of the Lambda function or decrease the batch size. In many cases such modification will resolve the issue.
Constant failure - it means that you have to either ignore the problematic event (put it in a DLQ - dead-letter-queue) or modify your code to handle it.

The problem is to identify the type of failure in your code and handle it differently. You need to write your Lambda code in a way to identify it (type of exception, for example) and react differently.

You can use the integration with CloudWatch to write such failures to the console and create the relevant alarms. You can use the CloudWatch Logs also as a way to log your "dead-letter-queue" and see what is the source of problem.

How do you handle if *some* of the events in a batch succeeded, but others failed? Consider a lambda that sends an email using SES for each event it receives. I might get a batch of 100 events, and send the first 20 emails correctly, but then SES has an outage for the rest of the time. I want to report a success of the first 20 events (so that I don't spam people), but I want to retry the latter 80. Is that possible? — Cam Jackson, Feb 03 '17 at 00:27
You can manage a list with lookup functionality to avoid duplications. You can use DynamoDB table with the key as email, and the value of the last email sent. Another common solution is to use Redis in ElastiCache with a TTL of the email keys. Before you send an email, you check when was the last time an email was sent to him, and you update the record on every successful sending. — Guy, Feb 05 '17 at 21:57
I'm facing the same scenario @CamJackson. DynamoDB now supports TTL that could be useful for this — Ezequiel Moreno, Mar 29 '17 at 04:37
Where order of messages is not important, would re-inserting the 80 failed events back to the same stream (retry again very shortly) or a new `retry_5_minutes_later` stream work? — aalimovs, May 18 '17 at 20:42

score 0 · Answer 3 · answered Jan 05 '23 at 13:29

In your lambda you can either throw an error and thus returning back the whole batch, or you can not throw an error and instead push it to an SQS queue to handle those messages differently. SQS has a retention period of 14 days. You could also have checkpoints with each record to know if the record was processed in the previous run.

If you have a lot of incoming data and you don't want to introduce any latency you could just ignore the error and just move on while adding those events to an SQQ queue.

Amazon Kinesis & AWS Lambda Retries

3 Answers3

Linked