15

Let's say I have a user table with id and timestamp attributes. I would like to be able to query on both parameters. If I understand the documentation correctly, there are two ways of doing this with DynamoDB:

  1. Define a hash+range primary key using id as the hash and timestamp as the range.
  2. Define a hash-only primary key using id and define a global secondary index using timestamp.

What are the benefits and drawbacks of each approach?

David Jones
  • 10,117
  • 28
  • 91
  • 139

3 Answers3

17

Define a hash+range primary key using id as the hash and timestamp as the range.

By making id the Hash Key and timestamp the Range Key, you are effectively creating a 'composite primary key'.

In order words, your DynamoDB schema would allow the following data (notice that 'john' is repeated three times)

id (Hash) | timestamp (Range)
----------|-------------------------
john      | 2014-04-28T07:53:29.000Z
john      | 2014-04-28T08:53:29.000Z
john      | 2014-04-28T09:53:29.000Z
mary      | 2014-04-28T07:53:29.000Z
jane      | 2014-04-28T07:53:29.000Z

And you can perform these operations:

  1. GetItem to get a single item based on the id (Hash Key) + timestamp (Range Key) combination
  2. Query to get a list of all items equal to the id (Hash Key)

If this is not what you intended for, then hash + range on id and timestamp respectively is not what you are looking for.

Define a hash-only primary key using id and define a global secondary index using timestamp.

Using a hash-only primary key on id, id must be unique.

id (Hash) | timestamp (GSI Hash Key)
----------|-------------------------
john      | 2014-04-28T07:53:29.000Z
mary      | 2014-04-28T07:53:29.000Z
jane      | 2014-04-28T07:53:29.000Z

Then by applying GSI hash-only on timestamp, you would be able to query for a list of ids for a particular timestamp.

The benefits to this approach is that, it is definitely the correct solution for your use case. #1 is misuse of range key (unless you are intending to ensure at the application level id is not duplicated which is probably a bad idea).

The drawback to using GSI are:

  1. There can only be a maximum of 5 GSI per table, so choose wisely what you want indexed DynamoDB Update Dec 2019 - You can now create as many as 20 GSI per table, and can further raise this soft limit through a request https://aws.amazon.com/about-aws/whats-new/2018/12/amazon-dynamodb-increases-the-number-of-global-secondary-indexes-and-projected-index-attributes-you-can-create-per-table/
  2. GSI will cost you additional money as you will need to assign Provisioned Throughput to it.
  3. GSI is eventually consistent, meaning that DynamoDB does not guarantee that the moment data associated to the table's hash key is written into DB, the data's GSI hash key immediately becomes available for querying. DynamoDB doc states that this is usually immediate, but can be the case that it could take up to seconds for the GSI hash key to become available.
  4. You cannot perform GetItem on a GSI to obtain an item based on its Hash Key / Hash Key + Range Key. You are limited to use Query which returns a List
Oh Chin Boon
  • 23,028
  • 51
  • 143
  • 215
  • Thanks for the great answer. I'm using RFC 4122 UUIDs for the ID attribute, so your comment about solution #1 being a misuse of a range key is probably fair. However, this drawback is only conceptual, while the drawbacks of using a GSI are quite tangible. This definitely gives me something to think about, though. – David Jones Apr 29 '14 at 19:46
  • I believe you can change GSI on the fly now. – Yuchen Feb 27 '16 at 02:49
  • If you use a simple primary key for the table, can you still query it? I thought that you could only scan tables with simple primary keys (hash without a range). – Jeff Vdovjak Jan 28 '20 at 22:12
5

This answer may be of some use, but you're right about the two ways that you could accomplish it.

Assuming that you use id as your hash key, then in order to retrieve an item using only a timestamp, you'll need a global secondary index. You can still make timestamp your the range key, which will be useful in that DynamoDB will use it to sort the results your queries by id.

The main drawback of using a global secondary index is that you'll need additional provisioned throughput on the table.

Community
  • 1
  • 1
rpmartz
  • 3,759
  • 2
  • 26
  • 36
  • 1
    Do you mean that even if I have hash+range key combination, if i want to query only range key, i still need to define range key as global secondary index? – Emil Dec 17 '16 at 15:27
  • 3
    @batmaci - Yes, any time you want to perform a query in DynamoDB you'll need to specify the hash key, whether that's the table's main hash key or an index's hash key. Think of DynamoDB as a hash table - if you don't have they key, then you have to scan through the whole table. – rpmartz Dec 18 '16 at 15:34
0

I have a similar interest and was considering creating a secondary index on a portion of the timestamp (e.g., day or hour) as the HASH and the Id as the RANGE to allow for a query against a particular time slice, but this would force all records within a time slice to be in the same partition for the index.

For being able to query recent data vs. historical data, Amazon recommends a multi-table design approach - see https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/bp-time-series.html.