3

I'm exploring the use of DynamoDB in the application I work on, which currently only has one database component -- a MySQL database running on RDS.

We pretty heavily use AWS and use a sharding scheme for our databases, but it can only get us so far without manual intervention. Playing around with Aurora I actually saw a significant drop in performance vs our MySQL database, so I'm evaluating DynamoDB to see it will work for us, as it can efficiently store JSON data, and also scale easily (just increase the reads or writes per second in the AWS console and let Amazon do the heavy lifting).

In several of our MySQL tables we have a primary key that is an autoincrement column, but we also have several indices on top of that to support query performance in other ways. The other indices are crucial as some of our tables have over 1 billion rows in them. In essence, we partition things among a client, an object_name, etc. So I might do something like this in MySQL:

Create Table: CREATE TABLE `record` (
  `id` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
  `client_id` int(10) unsigned NOT NULL,
  `data_id_start` bigint(20) unsigned NOT NULL,
  `data_id_end` bigint(20) unsigned NOT NULL DEFAULT '8888888888888888',
  `object_name` varchar(255) NOT NULL,
  `uuid` varchar(255) NOT NULL,
  `deleted` tinyint(1) unsigned NOT NULL DEFAULT '0',
  ...
  PRIMARY KEY (`id`),
  ...
  KEY `client_id_object_name_data_id_data_id_end_deleted` (`client_id`,`object_name`,`data_id_start`,`data_id_end`,`deleted`),
  KEY `client_id_object_name_data_id_end_uuid_id` (`client_id`,`object_name`,`data_id_end`,`uuid_id`),
  ...
) ENGINE=InnoDB AUTO_INCREMENT=1 DEFAULT CHARSET=utf8;

I'm evaluating duplicating some of this data into DynamoDB to use as a cache, so we don't have to go out to S3 to retrieve stored data there under certain situations. Instead, I'll just store the JSON data directly in the cache. In DynamoDB, it looks like I could use a HASH or a HASH and RANGE attribute in a key. So for example, I could use the autoincrement column from our MySQL table as the HASH, but then all of the examples I see of RANGE keys, global/local secondary indices, etc. only specify ONE other attribute as the RANGE. I want to create an index for efficient lookup when 3 or more values are specified in the "where" clause.

For example, I would like to query this table using an expression like this:

var params = {
    TableName: "Cache",
    KeyConditionExpression: "clientId = :clientId and objectName = :objectName and uuid = :uuid",
    ExpressionAttributeValues: {
        ":clientId": 17,
        ":objectName": "Some name",
        ":uuid": "ABC123-KDJFK3244-CCB"
    }
};

Notice that my "where clause" in the KeyConditionExpression uses 3 values. It's possible that we might have 4 or 5 values there. So is there any way in DynamoDB to create composite keys that have more than 2 attributes (columns) in them?

If not, I suppose that I could concatenate the 3 columns into a string and use that as my primary key on each insert. Or at least concatenate clientId and objectName, then use uuid as a RANGE or something like that. Effectively I need to page through all values for a specific clientId/objectName combination, and then based on some of the attributes in each row either take its value directly from the cache, or consider it a miss and retrieve the value from S3 (which is considerably slower).

twofifty6
  • 652
  • 2
  • 7
  • 10

1 Answers1

1

DynamoDB allows consistent low-latency queries on essentially infinite amount of data for this. The model you suggested with concatenating the values seems to be a good approach.

One thing to note is that hash key attribute values are limited to 2048 bytes. If the values you are concatenating are not predictable lengths (you can't pad them nicely) or exceed this limit, it may be a better approach to hash the value of the item and search based on the hash of the item. Here is the relevant documentation on limits: https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Limits.html. DynamoDB items are also limited to 400KB total data.

For correctness, I would also use some unique identifier for a range key, this will allow collisions for hash values (even if it's rare) and the schema is scalable because there are a small number of items per hash key value.

Ben Schwartz
  • 1,716
  • 13
  • 14
  • Hmm...if I concatenate the client and object it will actually cause quite a few collisions. Some of our customers have 24 million objects of the same type. That's a LOT of collisions if that is the hash value. Then the unique ID of each record could be a RANGE, but again -- that is still a very large number of potential collisions. But can you answer my question whether I can create an index with more than 2 columns? – twofifty6 Aug 31 '15 at 20:58
  • You cannot create an index with more than 2 columns. This restriction allows DynamoDB to provide consistent low latency writes/reads at any scale if the data model fits. Can you provide more information about why there would be so many collisions? If each client ID is unique and each client has unique object names, this will provide a great uniformly distributed hash key value. – Ben Schwartz Aug 31 '15 at 21:02
  • Ok thanks. 2 columns jives with what I see everywhere. Again, our clients can have millions of the objects of the same type. So concatenating clientId + objectName would mean millions of records all have the same hash. The uuid could be the RANGE attribute which is unique, but is it a problem that 24M+ records could have the same hash value, and then the uuid would break the tie? If this were relational I'd be doing something like "SELECT * FROM cache WHERE client_id=? AND object_name=? LIMIT X OFFSET Y" and paging through all records. – twofifty6 Aug 31 '15 at 21:16
  • You may run into hot key issues if single hash key is read/write accessed more frequently than others (so having 24M+ records with the same hash could be very problematic). So there are duplicate objectName values for the same clientId? The best way to do this would be to use some unique value with the clientId and objectName in the hash key value for lookup. Since a hash key value can be 2048 bytes there are 2^(2048*8) ~ 10^4932 possible values for a hash key. – Ben Schwartz Aug 31 '15 at 21:21
  • Yes. This is Salesforce data, so a single client might have millions of Account objects for example. A select might look something like: mysql> select client_id, object_name, uuid from record; | 1 | Account | AB1 | | 1 | Account | AB2 | | 1 | Account | AB3 | ... Maybe I should use another MySQL DB and index on (client_id, object_name), but I don't see how that's vastly diff than hash collisions in Dynamo. – twofifty6 Aug 31 '15 at 21:29
  • In DynamoDB your provisioned throughput is divided among partitions. Items with the same hash key are stored on the same partitions (sometimes a hash key can span multiple partitions if the partition is split for size). Would it be possible to use hash(clientId + objectName + uniqueItemIdentifier) for the hash key value? – Ben Schwartz Aug 31 '15 at 21:33
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/88417/discussion-between-twofifty6-and-ben-schwartz). – twofifty6 Aug 31 '15 at 21:37