171

I'm coming from a relational database background and trying to work with amazon's DynamoDB

I have a table with a hash key "DataID" and a range "CreatedAt" and a bunch of items in it.

I'm trying to get all the items that were created after a specific date and sorted by date which is pretty straightforward in a relational database.

In DynamoDB the closest thing I could find is a query and using the range key greater than filter. The only issue is that to perform a query I need a hash key which defeats the purpose.

So what am I doing wrong? Is my table schema wrong, shouldn't the hash key be unique? Or is there another way to query?

MatthewMartin
  • 32,326
  • 33
  • 105
  • 164
applechief
  • 6,615
  • 12
  • 50
  • 70

8 Answers8

91

Given your current table structure this is not currently possible in DynamoDB. The huge challenge is to understand that the Hash key of the table (partition) should be treated as creating separate tables. In some ways this is really powerful (think of partition keys as creating a new table for each user or customer, etc...).

Queries can only be done in a single partition. That's really the end of the story. This means if you want to query by date (you'll want to use msec since epoch), then all the items you want to retrieve in a single query must have the same Hash (partition key).

I should qualify this. You absolutely can scan by the criterion you are looking for, that's no problem, but that means you will be looking at every single row in your table, and then checking if that row has a date that matches your parameters. This is really expensive, especially if you are in the business of storing events by date in the first place (i.e. you have a lot of rows.)

You may be tempted to put all the data in a single partition to solve the problem, and you absolutely can, however your throughput will be painfully low, given that each partition only receives a fraction of the total set amount.

The best thing to do is determine more useful partitions to create to save the data:

  • Do you really need to look at all the rows, or is it only the rows by a specific user?

  • Would it be okay to first narrow down the list by Month, and do multiple queries (one for each month)? Or by Year?

  • If you are doing time series analysis there are a couple of options, change the partition key to something computated on PUT to make the query easier, or use another aws product like kinesis which lends itself to append-only logging.

Warren Parad
  • 3,910
  • 1
  • 20
  • 29
  • 8
    I want to emphasis the option you put forth in your last paragraph about considering "by year". Create an attribute like `yyyy` and hash on that, but also create a `created` date which you can use as your range key. Then you get 10GB of data per year (27 MB per day) which is probably fine for more circumstances. It does mean that you have to create a query per year when date queries go over the year boundary though, but at least it'll work and it's safer than creating a dummy hash key. – Ryan Shillington May 05 '17 at 15:50
  • 1
    Another option: http://stackoverflow.com/questions/35963243/how-to-query-dynamodb-by-date-range-key-with-no-obvious-hash-key?noredirect=1&lq=1 – Ryan Shillington May 05 '17 at 16:32
  • 1
    as the link above explains, strictly time-based partition keys can lead to hot spots. if you must use time-based partition keys, it's better to add some other element to the partition key to spread out a time period over multiple partitions. I have seen suggestions of just using a prefix between 0-n where n is the number of partitions each time bucket should be spread over. – dres May 07 '18 at 15:27
  • 1
    @RyanShillington There is no 10GB limit on *global* secondary indexes. That limit only applies to *local* secondary indexes. – Simon Forsberg Feb 18 '20 at 13:07
  • 2
    _"each partition only receives a fraction of the total set amount"_ -- This is no longer true due to [adaptive capacity](https://aws.amazon.com/blogs/database/how-amazon-dynamodb-adaptive-capacity-accommodates-uneven-data-access-patterns-or-why-what-you-know-about-dynamodb-might-be-outdated/). It seems to me that you could have a *dummy attribute* in the table that always has the same value. Then have a global secondary index using the dummy attribute as partition key and `CreatedAt` as the sort key. Then you can query by date across all items. Seems hacky, but *is there a better way?* – Bennett McElwee Nov 10 '20 at 02:08
  • @BennettMcElwee i went through the link you shared and mentions that it works fine(no hot partition issues) as long as does not exceed provisioned capacity "DynamoDB adaptive capacity enables your application to continue reading and writing to hot partitions without being throttled, provided that traffic does not exceed your table’s total provisioned capacity or the partition maximum capacity". With On-demand capacity, this adaptive thing will work? – lowLatency Feb 15 '22 at 04:51
  • (I mean, all of the on-demand capacity will be used for querying the hot partition or only partial capacity, assuming that the client is querying for keys which are in hot partition only) – lowLatency Feb 15 '22 at 04:51
46

Updated Answer:

DynamoDB allows for specification of secondary indexes to aid in this sort of query. Secondary indexes can either be global, meaning that the index spans the whole table across hash keys, or local meaning that the index would exist within each hash key partition, thus requiring the hash key to also be specified when making the query.

For the use case in this question, you would want to use a global secondary index on the "CreatedAt" field.

For more on DynamoDB secondary indexes see the secondary index documentation

Original Answer:

DynamoDB does not allow indexed lookups on the range key only. The hash key is required such that the service knows which partition to look in to find the data.

You can of course perform a scan operation to filter by the date value, however this would require a full table scan, so it is not ideal.

If you need to perform an indexed lookup of records by time across multiple primary keys, DynamoDB might not be the ideal service for you to use, or you might need to utilize a separate table (either in DynamoDB or a relational store) to store item metadata that you can perform an indexed lookup against.

AnkitG
  • 6,438
  • 7
  • 44
  • 72
Mike Brant
  • 70,514
  • 10
  • 99
  • 103
  • 22
    See the comments on the answer below; there are *not* ways to handle this now, at least not for what the OP asked. GSIs still require you to specify a hash key, so you cannot query for all records with `CreatedAt` greater than a certain point. – pkaeding Oct 22 '15 at 16:43
  • @pkaeding To perform a scan against the GSI, you must provide a hash value, this is correct. HOwever it is the hash value for the GSI (the key field) NOT the hash value for the underlying table. The scan API does allow for using `>` filtering conditions as would be required for this use case. – Mike Brant Oct 23 '15 at 21:07
  • @MikeBrant right, but the question asked about how to "get **all** the items that were created after a specific date and sorted by date" (emphasis mine). This is not possible, as far as I know. – pkaeding Oct 24 '15 at 05:15
  • @pkaedung You can scan the GSI using scan API. This is really no different than performin a scan operatoin on tables hash or range/hash keys on the table itself. The difference is you additionally specify the GSi to scan the data from. – Mike Brant Oct 26 '15 at 02:45
  • 5
    @pkaeding is right. You can get records older then some specific _date_ using __scan__, but you can't get them in sorted order. GSI won't help you in this case. It's not possible to sort _partition_ key, nor is it possible to query only _range_ key. – gkiko Nov 27 '15 at 10:57
  • 24
    For those of you confused. THIS ANSWER IS WRONG. His original answer is right but his updated answer is not. Read Warren Parad's answer below. It's correct. – Ryan Shillington May 05 '17 at 15:16
  • 1
    @MikeBrant I want to **query** (not scan, which looks at every item in the table, making it very inefficient and costy) a table on a table's GSI hash key (CreatedAt) using the greater-than symbol. As far as I know, this can't be done. – azizj Jun 04 '17 at 18:22
  • 5
    The problem you probably get while using a **date** as **primary partition** is that you might create a hotspot on some or one of the peers, due to the fact that in most data storages new data is queried more often than old data. – DrDirk Jul 10 '17 at 18:56
29

Approach I followed to solve this problem is by created a Global Secondary Index as below. Not sure if this is the best approach but hopefully if it is useful to someone.

Hash Key                 | Range Key
------------------------------------
Date value of CreatedAt  | CreatedAt

Limitation imposed on the HTTP API user to specify the number of days to retrieve data, defaulted to 24 hr.

This way, I can always specify the HashKey as Current date's day and RangeKey can use > and < operators while retrieving. This way the data is also spread across multiple shards.

Gireesh
  • 677
  • 7
  • 14
  • 2
    This is a good as it gets, except you can make smaller date slots to spread across more shards, depending on your use cases. If you only need to be able to find times that have passed "now", and you know that you process items fast enough, you could for example let the hash key be the date + the hour portion of the time, such as dividing a date 2021-04-17T16:22:07.000Z into a hash key of 2021-04-17T16 and a range key 22:07.000Z, which would let you search for "past" items with a query like dateHour = "2021-04-17T16" AND minutesSeconds <= 22:07 to find all items before that date. – JHH Apr 17 '21 at 15:25
  • 1
    And if "processing past items" means setting some flag, that flag could be a prefix of the hash key, such as `_T`. Then searching for `NOTDONE_2021-04-17T16` would not include `DONE_2021-04-17T16` items. – JHH Apr 17 '21 at 15:27
  • For what it is worth, this is more or less what Amazon shows in their [example](https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/bp-time-series.html#bp-time-series-examples) of working with time series data. It only has the time (as opposed to datetime) in the range but that's minor. – Captain Man Jan 25 '22 at 22:45
13

Your Hash key (primary of sort) has to be unique (unless you have a range like stated by others).

In your case, to query your table you should have a secondary index.

|  ID  | DataID | Created | Data |
|------+--------+---------+------|
| hash | xxxxx  | 1234567 | blah |

Your Hash Key is ID Your secondary index is defined as: DataID-Created-index (that's the name that DynamoDB will use)

Then, you can make a query like this:

var params = {
    TableName: "Table",
    IndexName: "DataID-Created-index",
    KeyConditionExpression: "DataID = :v_ID AND Created > :v_created",
    ExpressionAttributeValues: {":v_ID": {S: "some_id"},
                                ":v_created": {N: "timestamp"}
    },
    ProjectionExpression: "ID, DataID, Created, Data"
};

ddb.query(params, function(err, data) {
    if (err) 
        console.log(err);
    else {
        data.Items.sort(function(a, b) {
            return parseFloat(a.Created.N) - parseFloat(b.Created.N);
        });
        // More code here
    }
});

Essentially your query looks like:

SELECT * FROM TABLE WHERE DataID = "some_id" AND Created > timestamp;

The secondary Index will increase the read/write capacity units required so you need to consider that. It still is a lot better than doing a scan, which will be costly in reads and in time (and is limited to 100 items I believe).

This may not be the best way of doing it but for someone used to RD (I'm also used to SQL) it's the fastest way to get productive. Since there is no constraints in regards to schema, you can whip up something that works and once you have the bandwidth to work on the most efficient way, you can change things around.

E.T
  • 1,095
  • 1
  • 10
  • 19
  • 1
    You say there are no constraints, but you should know that this approach means you can save at most 10GB of data (the maximum of a single partition). – Ryan Shillington May 05 '17 at 15:14
  • 1
    This would have been the approach if DataID is known. But here we need to get every row for which the created is more than some date. – Yasith Prabuddhaka Oct 03 '18 at 07:26
4

You could make the Hash key something along the lines of a 'product category' id, then the range key as a combination of a timestamp with a unique id appended on the end. That way you know the hash key and can still query the date with greater than.

greg
  • 6,853
  • 15
  • 58
  • 71
1

You can have multiple identical hash keys; but only if you have a range key that varies. Think of it like file formats; you can have 2 files with the same name in the same folder as long as their format is different. If their format is the same, their name must be different. The same concept applies to DynamoDB's hash/range keys; just think of the hash as the name and the range as the format.

Also, I don't recall if they had these at the time of the OP (I don't believe they did), but they now offer Local Secondary Indexes.

My understanding of these is that it should now allow you to perform the desired queries without having to do a full scan. The downside is that these indexes have to be specified at table creation, and also (I believe) cannot be blank when creating an item. In addition, they require additional throughput (though typically not as much as a scan) and storage, so it's not a perfect solution, but a viable alternative, for some.

I do still recommend Mike Brant's answer as the preferred method of using DynamoDB, though; and use that method myself. In my case, I just have a central table with only a hash key as my ID, then secondary tables that have a hash and range that can be queried, then the item points the code to the central table's "item of interest", directly.

Additional data regarding the secondary indexes can be found in Amazon's DynamoDB documentation here for those interested.

Anyway, hopefully this will help anyone else that happens upon this thread.

DGolberg
  • 2,109
  • 4
  • 23
  • 31
  • I tried creating a DynamoDB table where there was AWSDynamoDBKeySchemaElement 'createdAt' of type hash and again the AWSDynamoDBKeySchemaElement 'createdAt' of type range and I got an error that said Error Domain=com.amazonaws.AWSDynamoDBErrorDomain Code=0 "(null)" UserInfo={__type=com.amazon.coral.validate#ValidationException, message=Both the Hash Key and the Range Key element in the KeySchema have the same name}. So I dont think what you're saying is correct. – user1709076 Dec 03 '15 at 14:22
  • I believe you misunderstood (though I suppose I wasn't very clear in my description, either). You cannot have 2 different attributes (columns) with the same name, in a table, but when you create a hash key with a range key, you can have multiple items that all use the same hash as long as their range is different, and vise-versa. For example: Your hash is "ID" and your range is "Date" you could have 2 instances of the ID "1234" as long as their Date is different. – DGolberg Dec 04 '15 at 15:37
  • Ah DGoldberg! I Get you now. That's great. So for my case since I only and always will just want to query for text messages 'after date = x', It looks like I could set all text messages to have the same 'fake_hash=1'. Then do my query.keyConditionExpression = @"fake_hash = 1 and #Date > :val". Thank you very much. If you have any other input I'd be happy to hear it since it does seem odd to have a hash that is always the same value? – user1709076 Dec 05 '15 at 19:03
  • I'd have to check again, but I'm pretty sure you can do a query on hash-only tables... though if you're using a date/time stamp as your hash, I'd recommend recording down to the shortest unit possible , like milliseconds or nano/microseconds (whatever the smallest unit of time the code can record is), in order to reduce the chance of overlapping date/time. Additionally, you can add optimistic locking to further reduce the possibility of overlaps: http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/JavaVersionSupportHLAPI.html Simply retry another time if there's a conflict. – DGolberg Dec 07 '15 at 05:29
-1

working query 1.aws dynamodb scan --table-name tableName --region us-east-1 --filter-expression "begins_with(createdTm,:gen)" --expression-attribute-values "{​​​​​​​":gen":{​​​​​​​"S":"2021-04-15"}​​​​​​​}​​​​​​​" --select "COUNT"

2.aws dynamodb scan --table-name tableName --region us-east-1 --filter-expression "createdTm BETWEEN :v1 AND :v2" --expression-attribute-values '{":v1":{"S":"2021-04-13"}, ":v2":{"S":"2021-04-14"}}' --select "COUNT"

  • Please `edit` to add an explanation, as to how/why your suggesting addresses the OP's issue, and fix formatting issues. You can highlight code then use the roll bar, or use markdown to format inline code, or code blocks, as appropriate. More information is available at StackOverflow..com/help – SherylHohman Apr 15 '21 at 21:38
-10

Updated Answer There is no convenient way to do this using Dynamo DB Queries with predictable throughput. One (sub optimal) option is to use a GSI with an artificial HashKey & CreatedAt. Then query by HashKey alone and mention ScanIndexForward to order the results. If you can come up with a natural HashKey (say the category of the item etc) then this method is a winner. On the other hand, if you keep the same HashKey for all items, then it will affect the throughput mostly when when your data set grows beyond 10GB (one partition)

Original Answer: You can do this now in DynamoDB by using GSI. Make the "CreatedAt" field as a GSI and issue queries like (GT some_date). Store the date as a number (msecs since epoch) for this kind of queries.

Details are available here: Global Secondary Indexes - Amazon DynamoDB : http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/GSI.html#GSI.Using

This is a very powerful feature. Be aware that the query is limited to (EQ | LE | LT | GE | GT | BEGINS_WITH | BETWEEN) Condition - Amazon DynamoDB : http://docs.aws.amazon.com/amazondynamodb/latest/APIReference/API_Condition.html

Sony Kadavan
  • 3,982
  • 2
  • 19
  • 26
  • 32
    I downvoted because as far as I can tell, your answer is incorrect. Much like a table's primary key, you can query a GSI's hash key only with the EQ operator. If you were implying that `CreatedAt` should be the GSI's range key, then you'll need to choose a hash key - and then you're back where you started, because you'll be able to query GT on `CreatedAt` only for a specific value of the hash key. – PaF Nov 16 '14 at 09:05
  • Agreed with PaF. Using a GSI with the hash key as the creation time does not help with question asked in the OP. – 4-8-15-16-23-42 May 27 '15 at 20:41