8

I've been reading some DynamoDB index docs and they've left me more confused than anything. Let's clear the air with a concrete example.

I have a simple calendar application, where I have an events table. Here are the columns I have:

id: guid,
name: string,
startTimestamp: integer,
calendarId: guid (foreign key in a traditional RDBMS model)
ownerId: guid (foreign key in a traditional RDBMS model)

I'd like to perform queries such as:

  • Get an event by ID
  • Get all events where calendarId = x and ownerId = y
  • Get all events where startTimestamp is between x and y and calendarId = z

DynamoDB docs seem to heavily suggest avoiding using the event's ID as a partition/sort key here, so what's the recommended schema?

Jim
  • 4,509
  • 16
  • 50
  • 80

2 Answers2

19

This is a problem that everyone wrestles with when they start with (and indeed when they are experienced with) DynamoDB.

Pricing and throughput

Let's start with how DynamoDB is priced (its related - honestly). Ignoring the free tier for a moment, you pay $0.25 per GB per month for data at rest. You also pay $0.47 per Write Capacity Unit (WCU) per month and $0.09 per Read Capacity Unit (RCU) per month. Throughput is the number of WCUs and RCUs on your table. You have to specify throughput up front on your table - the volume of writes and reads you can perform on your table is limited by your throughput provision. Pay more money and you can do more reads and writes per second. The exact details of how DynamoDB partitions tables can be found in this answer.

Keys

Now we need to consider table partitioning. Tables must have a primary key. A primary key must have a hash key (aka a partition key) and may optionally have a sort key (aka a range key). DynamoDB creates partitions based on your hash key values. Within a partition key value the data is sorted by range key, if you have specified one.

Data Access

If you have the exact primary key (hash key and range key if there is one), you can instantly access an item using GetItem. If you have multiple items to get, you can use BatchGetItem.

DynamoDB can only 'search' data in two ways. A Query can only take data from one partition in one call, because it uses the partition key (and optionally a sort key) it is quick. A Scan always evaluates every item in table, so its typically slow and doesn't scale well on large tables.

Throughput distribution

This is where is gets interesting. DynamoDB takes all the throughput you have purchased and evenly spreads it over all of you table partitions. Imagine you have 10 WCUs and 10 RCUs on your table, and 5 partitions, that means you have 2 WCUs and 2 RCUs per partition. That's fine if you access each partition evenly, you get to use all of your purchased throughput. But imagine you only ever access one partition. Now you've purchased 10 WCUs and RCUs but you are only using 2. Your table is going to be much slower than you thought. One option is to just buy more throughput, that will work, but its probably not very satisfactory to most engineers.

Uniform Access v Natural Access

Based on the above we know we want to design a table where each partition gets accessed evenly. However, in my experience people get too hung up about this, which is not surprising if you read the article I just linked (which you also linked).

Remember that partition keys is what we use in a Query to get our data fast, and avoid regular Scans. Some people get too focussed making their partition access perfectly uniform, and end up with a table they can't query quickly.

The answer

I like to refer to Best Practices for Tables guide. And particularly the table where it says User ID is a good partition key so long many user access your application regularly. (It actually says where you have many users - which is not correct, the size of the table is irrelevant).

Its a balance between uniform access and being able to use intuitive, natural queries for your application, but what I am saying is, if you are new to DyanmoDB, the right answer probably is to design your table based on intuitive access. After you've done that successfully, have a think about uniform access and hot partitions, but just remember access doesn't have to be perfectly uniform. There are various design patterns to achieve both intuitive and uniform access, but these can be complicated for those starting out and in many cases can probably discourage people using DynamoDB if they get too focussed on the uniform access idea.

Tips

Most applications will have users. For most queries, in most applications, the most common query you will do is get data for a user. So the first option for most application's primary partition key will often be a user id. That's fine, as long as you don't have a few very high hitting users and many users that never log in.

Another tip. If your table is called vegetables, your primary partition key will probably be vegetable id. If your table is called shoes, your primary partition key will probably be shoe id.

Most applications will have many items for each user (or vegetable or shoe). The primary key has to be unique. A good option often is to add a date range (sort) key - perhaps the datetime the item was created. This then orders the items within the user partition by creation date, and also gives each item a unique composite primary key (i.e. hash key + range key). It's also fine to use a generated UUID as a range key, you wont use the ordering it gives you, but you can then have many items per user and still use the Query function.

Indexes are not a solution

Aha! But I can just make my partition key totally random, then apply an index with a partition key of the attribute I really want to query on. That way I get uniform access AND fast intutive queries.

Sadly not. Indexes have their own throughput and partitioning, separate to the table the index is built on. Just imagine indexes as a whole new table - that's basically what they are. Indexes are not a work around to uneven partition access.

Finally - your schema

Primary Key

Hash Key: Event ID

Range Key: None

Global Secondary index

Hash Key: Calendar ID

Range Key: startTimestamp

Assuming Event ID is uniformly accessed, it would be a great hash key. You would really need to describe how your data is distributed to discuss this much more. Other things that come in to play are how fast you want queries to work and how much you are willing to pay (e.g. secondary indexes are expensive).

And your queries:

Get an event by ID

GetItem using Event ID

Get all events where calendarId = x and ownerId = y

Query by GSI parition key, add a condition on ownerId

Get all events where startTimestamp is between x and y and calendarId = z

Query by GSI parition key, add a condition on range key

F_SO_K
  • 13,640
  • 5
  • 54
  • 83
  • 3
    There are several things listed in your answer that are not entirely correct, or no longer correct: ie. "DyanmoDB creates a partition for each of your hash keys." - this is not true; it's never been true; "DynamoDB takes all the throughput you have purchased and evenly spreads it over all of you table partitions" - sort of true with caveats and especially now with adaptive capacity it's becoming a bit more murky than it used to be – Mike Dinescu Feb 03 '18 at 00:38
  • Thanks - updated the answer. I didn't change the answer based on your second point, but added a link so people can go and read more. – F_SO_K Feb 03 '18 at 09:14
  • @MikeDinescu Source? Where can I read more about that? – Sebastian Nielsen Aug 21 '22 at 12:02
  • @SebastianNielsen see: https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/bp-partition-key-design.html#bp-partition-key-partitions-adaptive – Mike Dinescu Aug 21 '22 at 21:12
0

I just want to add something to the accepted anwser:

Get all events where calendarId = x and ownerId = y

Query by GSI parition key, add a condition on ownerId

This method is not reliable. I guess that when you say "add a condition on ownerId", you mean "add a Filter expression on ownerId" (Definition by Alex DeBrie)

But the 1MB read limit by DynamoDB makes it unreliable.

It is better explained in the link above, but here is the sumup: If you calendar has a lot of events, that represent data with size over 1MB, the results on which you apply the condition ownerId==X will be truncated to the first 1MB, excluding the rest of the data.

AlexandreS
  • 645
  • 1
  • 9
  • 17