0

I read this answer, which clarified a lot of things, but I'm still confused about how I should go about designing my primary key.

First off I want to clarify the idea of WCUs. I get that WCU is the write capacity of max 1kb per second. Does it mean that if writing a piece of data takes 0.25 seconds, I would need 4 of those to be billed 1 WCU? Or each time I write something it consumes 1 WCU, but I could also write X times within 1 second and still be billed 1 WCU?

Usage

I want to create a table that stores the form data for a set of gyms (95% will be waivers, the rest will be incidents reports). Most of the time, each forms will be accessed directly via its unique ID. I also want to query the forms by date, form, userId, etc..

We can assume an average of 50k forms per gym

Options

  • First option is straight forward: having the formId be the partition key. What I don't like about this option is that scan operations will always filter out 90% of the data (i.e. the forms from other gyms), which isn't good for RCUs.

  • Second option is that I would make the gymId the partition key, and add a sort key for the date, formId, userId. To implement this option I would need to know more about the implications of having 50k records on one partition key.

  • Third option is to have one table per gyms and have the formId as partition key. This seems to be like the best option for now, but I don't really like the idea of having a a large number of tables doing the same thing in my account.

Is there another option? Which one of the three is better?

Edit: I'm assuming another option would be SimpleDB?

Charles
  • 363
  • 2
  • 5

2 Answers2

1

For your PK design. What data does the app have when a user is going to look for a form? Does it have the GymID, userID, and formID? If so, make a compound key out of that for the PK perhaps? So your PK might look like:

234455::53894302::245 

Where 23445 is the GymID, 53894302 is the user's ID, and 245 is the form ID. You might even move the form ID to the sort key and along with a date, you could have an SK of form::245::. Then you could easily get all items of type form for that user, or all form 245s for that user. or all form 245s in 2020 for that user, by using the begins_with() expression in your QUERY.

This might not be an exactly what you should do, but play with it and see what options you come up with. One thing to think about is what happens when a user moves gyms? perhaps in that rare event, you rewrite all of their items in the DB with the new gymID. Perhaps you do not have the gymID in the PK. without a lot more info, it is difficult to say. Hopefully this is enough for you to chew on so you can come up a solution.

NoSQLKnowHow
  • 4,449
  • 23
  • 35
  • BatchWriteItem would cost 4 WCU, https://stackoverflow.com/questions/54138582/aws-dynamodb-batchwriteitem-write-capacity-units – Charles May 21 '21 at 19:21
  • The issue is that the only valid operation for the Partition key is '='. Hence, the need for option #2 – Charles May 21 '21 at 19:55
  • Yep, you are correct about the WCUs. I was thinking of something else. – NoSQLKnowHow May 21 '21 at 20:26
  • and I get that the only valid operation for the PK is =, but that doesn't mean you cannot still put compound information in that key. If the app knows the gym and the user IDs, it can easily concatenate that you have the PK and thus the item collection. – NoSQLKnowHow May 21 '21 at 20:27
  • Some request may not include the userId. i.e getting all the forms filled today at a particular gym – Charles May 21 '21 at 20:40
1

Every call that writes to DDB consumes at least 1 (standard) or 2 (transactional) WCUs. Assuming your items are less the 1KB in size.

See Provisioned Throughput key point

Item sizes for writes are rounded up to the next 1 KB multiple. For example, writing a 500-byte item consumes the same throughput as writing a 1 KB item.

So writing 4 items in one second will require 4 WCU. But "burst" mode means you might temporarily be able to write 4 items a second for a short period of time in a table that's only provisioned for 2 WCU.

As far as your proposed options. It depends. You mentioned some general access patterns, but not specifics nor if those are the only ones you need.

In an RDBMS, you have to know ahead of time how you want to store the data. But accessing that data is very flexible.

In DDB, you have to know how you need to access the data, but the storage structure is flexible.

Some general feedback:

  • Scan() is an operation of last resort, it should be used very, very infrequently if at all.
  • 50K records for a partition key isn't a big deal. What matters (but less than it used to) is how distributed your accesses are to each partition key. Ideally, you want a uniform distribution of access across all your partition keys.
  • one table per gym is a valid multi-tenant strategy. But there's management/overhead costs.

Assuming you actually have mutli-tenants, ie. each gym is an individual customer. Then I'd lean toward having gymid be the hash key so I could take advantage of enforcing tenant isolation via IAM roles as outlined in this article.
Cavet: This could be problematic if tenants are NOT of approximately the same size. But again, less of a problem than it originally was.

Charles
  • 21,637
  • 1
  • 20
  • 44
  • I can't assume that all gyms will have the same size. If a gym is bigger, it will have more data and thus more traffic. – Charles May 21 '21 at 19:58
  • 1
    Again, not the problem it used to be. Plus, unless you're looking at 10GB+ of data or more than 3000RCU/1000WCU ... your data is likely to end up in a single physical partition anyway. – Charles May 21 '21 at 20:07
  • Ok then I guess I'll go with option 2 for now. I just need to make some tests with provisioned capacities so I can try to predict how it'll behave in production. – Charles May 21 '21 at 20:20
  • when you say "3000RCU/1000WCU" is that per second? Or is that consumed units in total – Charles May 21 '21 at 20:28