1

Folks, What would you suggest the DynamoDB Table structure be for the following Object? There will be roughly 2 million objects, which will need to be searchable by email and/or organization.

{
  email: 'foo@bar.com',
  organization: 'foobar'
}

What would you make the Hash/Range Keys be? I need to be able to perform the followign operations:

  • Retrieve all emails for specific organization
  • Delete specific email

Should i add a random id parameter to the Table? I would imagine the following is the correct way:

  • organization being the Hash Key, email being the Range Key.

Thanks

Cmag
  • 14,946
  • 25
  • 89
  • 140

3 Answers3

1

It seems that either of those would distribute your objects well as hash keys, so I don't know that either of them is necessarily a better hash key per se. I think that the fact that you'll need to retrieve all of the specific emails for an organization makes that the better candidate for a hash key, though. You can just do a query using the organization to get all of an organization's emails.

Note that in order to support the use cases you described, you'll need a global secondary index. This answer may be helpful in showing why, but assuming that you went with Organization as the table hash key, you'd need a global secondary index on email to retrieve a specific email (or retrieve that item to delete it).

Community
  • 1
  • 1
rpmartz
  • 3,759
  • 2
  • 26
  • 36
  • So if `organization` is the hashKey and `email` is rangeKey, then I would have to do table scans to get all Organizations, and queries to get emails for a particular organization. correct? – Cmag Apr 22 '14 at 19:44
  • Therefore, if I have 1 million records, 86 bytes each, thats 86000 KB, which would mean I would need 21,500 provisioned reads!? – Cmag Apr 22 '14 at 19:53
  • Would it be better to have the hashKey be a known value of say `1` then you can have rangeKey being your `organization` and secondary index being your `email` or is this a hack? – Cmag Apr 22 '14 at 19:57
0

The problem is the provisioned capacity, and Scan operations. If you have 1 million records, 85 bytes each, that amounts to 86000 KB, which would require 21,000 provisioned reads!

At this point, to keep the costs down, I see no other alternative than to have the following structure:

| Hash Key | Range Key    | Secondary Range Key |
| 1        | organization | email               |

in other words:

| Hash Key | Range Key    | Secondary Range Key |
| 1        | foo          | asdf@foo.com        |
| 1        | bar          | asdf@bar.com        |
| 1        | foo          | fdsa@foo.com        |

This means you always know your HashKey. And using it, you can do queries on specific RangeKeys.

Thoughts?

Cmag
  • 14,946
  • 25
  • 89
  • 140
  • Amazon is pretty explicit that your table hash key should distribute your items relatively uniformly, so I don't think using the same key for all the items is a great idea. I'd recommend figuring out what queries your application needs to support and then designing your table with local and global secondary indexes appropriately. In some cases it may be inevitable, but you don't want to be doing `Scan`s if you can avoid it. – rpmartz Apr 22 '14 at 20:14
  • @RyanM using my example I am avoiding doing any Scans. Only Queries. yes, the reads are not distributed... but what that seems to be the only option. Either `scan` or have a single known hashkey – Cmag Apr 22 '14 at 20:18
  • If you're avoiding doing scans, you don't need to read all 1 million records as you suggested in your comment to my answer. – rpmartz Apr 22 '14 at 21:10
0

CIn your base table use email as hashkey as it is more random than Department so it can be parititioned well.

Create a GSI with Organization as hashkey.

1) Retrieve all emails for specific organization

query your GSI with hashkey equals to the target org.

2) Delete specific email

easily done because email is the hashkey of your base table.

A low provisioned throughput will still work. the only effect is that your scan will take longer time. If your Read Provisioned throughput read is 10, then your scan will take about:

21000 / 10 = 2100 seconds.

I think for Scan operation you can set a limit for how many items it should return. The result will also include a lastEvaluatedKey which you can provide in your Scanning call for the next page.

Erben Mo
  • 3,528
  • 3
  • 19
  • 32