47

I can't find in the formal documentation of AWS Kinesis any explicit reference between TRIM_HORIZON and the checkpoint, and also any reference between LATEST and the checkpoint.

Can you confirm my theory:

  • TRIM_HORIZON - In case the application-name is new, then I will read all the records available in the stream. Else, application-name was already used, then I will read from my last checkpoint.

  • LATEST - In case the application-name is new, then I will read all the records in the stream which added after I subscribed to the stream. Else, application-name was already used, I will read messages from my last checkpoint.

  • The difference between TRIM_HORIZON and LATEST is only in case the application-name is new.

Ida Amit
  • 1,411
  • 2
  • 13
  • 27
  • 2
    both these answers doesn't clearly tell if this matters only during the first time you create even source mapping or you lose data with LATEST in steady state. – iamprem Mar 27 '20 at 20:03
  • 1
    The real question should be how to lose data with kinesis. – John Mercier May 04 '22 at 20:04

4 Answers4

30

AT_TIMESTAMP

-- from specific time stamp

TRIM_HORIZON

-- all the available messages in Kinesis stream from the beginning (same as earliest in Kafka)

LATEST

-- from the latest messages , i.e current message that just came into Kinesis/Kafka and all the incoming messages from that time onwards

Mooncrater
  • 4,146
  • 4
  • 33
  • 62
Suresh
  • 38,717
  • 16
  • 62
  • 66
  • 2
    What does "from the latest messages" mean? Reverse order? Can you expand on this? – Kirk Broadhurst Nov 04 '19 at 13:51
  • 2
    Not reverse order, but skipping messages to start from "now" and moving forward (the idea is the stream is constantly receiving new data, and you can use this as a catch up mechanism at the expense of data loss) – Krease Nov 04 '19 at 14:14
  • 2
    the latest message means the current message that just came into Kinesis. so consumer start consuming the messages from that message and any future messages that come into Kinesis – Suresh Nov 11 '19 at 20:26
  • 1
    @Krease when you say expense of data loss, does it mean only the time when we initially setup the event source mapping or does the data loss happen even after that if you push too much data into dynamo/kinesis and the lambda is slow to process and only pick up latest record by skipping unprocessed records that entered the stream after we setup the event source mapping? – iamprem Mar 27 '20 at 20:01
  • @iamprem - it's just a mechanism to skip records - say your processing is very far behind (ie an hour), you can skip past that hour (losing the records) and start processing more recent ones. This is most useful in scenarios where data recency is more important than data completeness. – Krease Mar 28 '20 at 00:21
18

From GetShardIterator documentation (which lines up with my experience using Kinesis):

In the request, you can specify the shard iterator type AT_TIMESTAMP to read records from an arbitrary point in time, TRIM_HORIZON to cause ShardIterator to point to the last untrimmed record in the shard in the system (the oldest data record in the shard), or LATEST so that you always read the most recent data in the shard.

Basically, the difference is whether you want to start from the oldest record (TRIM_HORIZON), or from "right now" (LATEST - skipping data between latest checkpoint and now).

Krease
  • 15,805
  • 8
  • 54
  • 86
3

The question clearly asks how these options relate to the checkpoint. However, none of the existing answers addresses the checkpoint at all.

An authoritative answer to this question by Justin Pfifer appears in a GitHub issue here.

The most relevant portion is

The KCL will always use the value in the lease table if it's present. It's important to remember that Kinesis itself doesn't track the position of consumers. Tracking is provided by the lease table. Leases in the KCL server double duty. They provide both mutual exclusion, and position tracking. So for mutual exclusion a lease needs to be created, and to satisfy the position tracking an initial value must be selected.

(Emphasis added by me.)

Chuck Batson
  • 2,165
  • 1
  • 17
  • 15
0

I think choosing between either is a trade off between do you want to start from the most recent data or do you want to start from the oldest data that hasnt been processed from kinesis.

Imagine a scenario when there is a bug in your lambda function and it is throwing an exception on the first record it gets and returns an error back to kinesis because of which now none of the records in your kinesis are going to be processed and going to remain there for 1 day period(retention period). Now after you have fixed the bug and deploy your lambda now your lambda will start getting all those messages from the buffer that kinesis has been holding up. Now your downstream service will have to process old data instead of the most recent data. This could add unwanted latency in your application if you choose TRIM_HIROZON.

But if you used LATEST, you can ignore all those previous stuck messages and have your lambda actually start processing from new events/messages and thus improving the latency your system provides.

So you will have to decide which is more important for your customers. Is losing a few data points fine and what is your tolerance limit or you always want accurate results like calculating sum/counter.

Ankur Kothari
  • 822
  • 9
  • 11