Dynamodb bulk query

Question

I have an index used for bulk operations on collections that is experiencing throttling. To mitigate this am planning to shard the index so each pk is split over whatever number of partitions. At the moment there is a delete operation running on the base table using the index, so what happens is we query a set number of items against a pk in the index, delete them, then repeat until finished.

The problem I see here is that if I do something similar with the sharded partition keys now I will just end up iterating through each partition and get the same issue with throttling on the base table when deleting. I was wondering if there is a way to issue a bulk query in dynamo so I can for example checks all shards and retrieve a set with an even distribution of items across them?

score 1 · Answer 1 · answered Nov 10 '22 at 11:53

Its important to understand the cause and magnitude of your GSI throttling. Is it write or read throttling your are experiencing? Is your GSI partition key of low cardinality?

Assuming writes is the issue, you only need to shard the GSI keys which are consuming more than 1000 WCU per second. So imagine your expected throughput requires 4000 WCU per second, then you will need to only shard 4-5 times. You can then use PartiQL API to run a "batch query" to retrieve all the items in a single call:

SELECT * FROM "mytable"."index" WHERE GSIPK IN ["a-1","a-2","a-3","a-4"]

This article contains more info on sharding Item Collections on DynamoDB:

https://medium.com/@leeroy.hannigan/optimizing-dynamodb-queries-using-key-sharding-f3eb4d7f78f7

@Person1 this is a better answer, consider this first – Borislav Stoilov Nov 15 '22 at 10:40 — Borislav Stoilov, Nov 15 '22 at 10:40

score 0 · Answer 2 · answered Nov 10 '22 at 09:22

0

Are you talking about global secondary indexes? If yes they have their own capacity and splitting the index into multiple indexes will have a positive impact for sure.

That thing aside, are you able to use TTL instead of querying and deleting items? TTL is free, done in the background and will cause no throttling what so ever.

From the docs

TTL is useful if you store items that lose relevance after a specific time. The following are example TTL use cases:

Remove user or sensor data after one year of inactivity in an application.

Archive expired items to an Amazon S3 data lake via Amazon DynamoDB Streams and AWS Lambda.

Retain sensitive data for a certain amount of time according to contractual or regulatory obligations.

answered Nov 10 '22 at 09:22

Borislav Stoilov

3,247
2
21
46

right but in order to benefit from the sharding i will need to run multiple queries. So instead of querying 100 items from somePk, and then running batched delete on the resulting item keys. I would need to say query 50 from somePk#shard1 and query 50 from somePk#shard2 ? Theres no way to combine the query as one? In the case of ttl I cant use it because the items need to stick around until specifically deleted, also intending to use the index of other bulk write operations aside from deletions at some point. – Person1 Nov 10 '22 at 11:42
No, you can't combine it as every query needs a concrete partition key to be specified. The limit for batch operations is 25 items anyways, you will still have to perform 4 batch deletes no matter how you split the 100 elements – Borislav Stoilov Nov 10 '22 at 12:06

Dynamodb bulk query

2 Answers2