Background
I am currently trying to figure out the best way of calculating some stats in lambda function based on the db design I have. Let’s say I have records of users from China which has 23 provinces that I stored in an array, to which I want to determine the total number of both females and males, as well as the number of users in each province.
Given a GSI table with 200,000 items with a total size of 100bytes per item as seen below, with the province attribute being the partition key.
{
"createdAt": {
"S": "2020-08-05T19:21:07.532Z"
},
"gender": {
"S": "Male"
},
"updatedAt": {
"S": "2020-08-05T19:21:07.532Z"
},
"province": {
"S": "Heilongjiang"
}
}
I am considering using two methods for this calculation:
1.Query method
I plan on looping over the province array and providing a partition key on each loop to the query method which would end up making too many requests(23 to be precise, that’s if the each request returned doesn’t pass the limit of 1MB which might lead me to keep repeating until there is no more lastEvaluationKey for the current query).
2.Scan method
In this method, I would make requests iteratively to the database until there is no more lastEvaluationKey.
Having the knowledge of both scan and query methods being able to return only 1mb of data, which method would be the most appropriate to use in this particular use case?
I am considering going for the scan method seeing as I would need to read all the data in table in order to calculate the stats anyways; however, I am afraid of how slow the operation will become when the table grows.
PS: Suggestions for a different keySchema for better access would also be very appreciated.