2

I was looking at DynamoDB to store some data because it looked like it might be a cost effective solution, but after a bit of research I think that it might not fit my use case because I am unable to find relevant unique values for partition and sort keys.

My data are a series of records of natural events for various species of plants e.g. the date and location that someone noticed a Beech tree's leaves appearing.

{
  "species": "Beech",
  "event": "Budburst",
  "year": 2015,
  "season": "Spring",
  "date": "12/04/2015",
  "latitude": "0.00000",
  "longitude": "40.000"
}

The main query for the application would be to obtain all of the data for a certain species for a certain event in a certain year:

Endpoint: events/:species/:event-type/:year

This is likely to return a few thousand events which can then be shown in a map.

If this were MongoDB, then I might create an index on a composite field of species+eventType+year. It wouldn't be a unique index, but at least only the few thousand results would be scanned rather than the whole table, so it wouldn't be too bad.

I'm not sure how to achieve the same thing in DynamoDB, though, or even if it is possible, because the partition key, or the partition + sort key combination seems to have to be unique.

Is the only way to make this work to have an incrementing unique event id for the partition key, and then have the species+eventType+year string as the sort key?

If there are any other patterns I'd be grateful to hear about them.

Thanks for reading.

Joe
  • 4,852
  • 10
  • 63
  • 82

2 Answers2

1

It sounds like a natural primary key would be species as the hash key and eventType+timeStamp as the sort key. (Use ISO-8601 for the timestamp so that you can query using the begins_with function in your KeyConditionExpression.)

If it's possible that there's more than one event for a given species and event type at the same time, or if you simply lack precise time stamps for the events, then you can use a UUID as the hash key, and create a GSI with species as the hash key and eventType+year, or even species+eventType+year as the hash key, since primary keys do not have to be unique in a GSI.

Also, here is a helpful related question which asks, "How to query DynamoDB by date (range key), with no obvious hash key?"

Matthew Pope
  • 7,212
  • 1
  • 28
  • 49
  • Thanks Matthew. The data will be queried very much by year rather than any more precise metric. But as you suggest I'll use the species+eventType+year + a UUID. Thank also for the link to the other SO question. Interesting to read what he says about using timestamp as the partition may well lean to a hot partition. Good to know. – Joe Aug 05 '18 at 19:45
  • @Joe I recommended the use of the full timestamp because it would mean that you can have a unique primary key that you can query by year without having to generate a uuid or use a GSI. The only caveat is that you need to ensure that events don't occur simultaneously at whatever time resolution you use. – Matthew Pope Aug 05 '18 at 20:48
  • Thanks Matthew. Unfortunately I don't have a full timestamp. There isn't that level of specificity in the data, or any guarantee that the timestamps would be unique if there were. – Joe Aug 06 '18 at 09:54
1

You could do something like this:

{
  "species+event+year": "BeechBudhurst2015",
  "eventId": 1111-2222-3333-4444
  "species": "Beech",
  "event": "Budburst",
  "year": 2015,
  "season": "Spring",
  "date": "12/04/2015",
  "latitude": "0.00000",
  "longitude": "40.000"
}

Create a UUID for each event. This is good practice anyway, there should always be something you can uniquely identify an event with.

As you've already identified, create a composite attribute of species+event+year.

Make species+event+year your partition key and eventId (the UUID) your range key.

When you do a Query, just provide the partition key, which will give you all species with a particular event in a certain year.

If you wanted to use Get item to retrieve an individual event, you would need to specify both the partition and range key.

This design is highly optimised for getting species+event+year. If there are other queries you want to optimise, you might consider having the primary partition key of the eventId - this would be a more common design I think. Then create a GSI for each optimised query (e.g. GSI partition key species+event+year). Note that GSI partition keys do not need to be unique, so there would be no need to set a range key to make each item unique. The downside to using GSIs is that you have have to provision them separately (i.e. it costs you more money).

F_SO_K
  • 13,640
  • 5
  • 54
  • 83
  • Great. Thanks @Stu. I didn't know that GSI partition keys don't need to be unique. That's very helpful. For now I will try out using the species+event+year as the partition key as I really only need that one query at the moment and I will no doubt learn quite a bit just by doing it. – Joe Aug 05 '18 at 19:42