0

In my application I run a cron job to loop over all users (2500 user) to choose an item for every user out of 4k items, considering that: - choosing the item is based on some user info, - I need to make sure that each user take a unique item that wasn't taken by any one else, so relation is one-to-one

To achieve this I have to run this cron job and loop over the users one by one sequentially and pick up the item for each then remove it from the list (not to be chosen by next user(s)) then move to the next user

actually in my system the number of users/items is getting bigger and bigger every single day, this cron job now takes 2 hours to set items to all users.

I need to improve this, one of the things I've thought about is using Threads but I cant do that since Im using automatic scaling, so I start thinking about push Queues, so when the cron jobs run, will make a loop like this:

for(User user : users){
    getMyItem(user.getId());
}

where getMyItem will push the task to a servlet to handle it and choose the best item for this person based on his data.

Let's say I'll start doing that so what will be the best/robust solution to avoid setting an item to more than one user ?

Since Im using basic scaling and 8 instances, can't rely on static variables. one of the things that came across my mind is to create a table in the DB that accept only unique items then I insert into it the taken items so if the insertion is done successfully it means no body else took this item so i can just assign it to that person, but this will make the performance a bit lower cause I need to make write DB operation with every call (I want to avoid that)

Also I thought about MemCach, its really fast but not robust enough, if I save a Set of items into it which will accept only unique items, then if more than one thread was trying to access this Set at the same time to update it, only one thread will be able to save its data and all other threads data might be overwritten and lost.

I hope you guys can help to find a solution for this problem, thanks in advance :)

Tamer Saleh
  • 473
  • 9
  • 21

2 Answers2

1

First - I would advice against using solely memcache for such algorithm - the key thing to remember about memcache is that it is volatile and might dissapear at any time, breaking the algorithm.

From Service levels:

Note: Whether shared or dedicated, memcache is not durable storage. Keys can be evicted when the cache fills up, according to the cache's LRU policy. Changes in the cache configuration or datacenter maintenance events can also flush some or all of the cache.

And from How cached data expires:

Under rare circumstances, values can also disappear from the cache prior to expiration for reasons other than memory pressure. While memcache is resilient to server failures, memcache values are not saved to disk, so a service failure can cause values to become unavailable.

I'd suggest adding a property, let's say called assigned, to the item entities, by default unset (or set to null/None) and, when it's assigned to a user, set to the user's key or key ID. This allows you:

  • to query for unassigned items when you want to make assignments
  • to skip items recently assigned but still showing up in the query results due to eventual consistency, so no need to struggle for consistency
  • to be certain that an item can uniquely be assigned to only a single user
  • to easily find items assigned to a certain user if/when you're doing per-user processing of items, eventually setting the assigned property to a known value signifying done when its processing completes

Note: you may need a one-time migration task to update this assigned property for any existing entities when you first deploy the solution, to have these entities included in the query index, otherwise they would not show up in the query results.

As for the growing execution time of the cron jobs: just split the work into multiple fixed-size batches (as many as needed) to be performed in separate requests, typically push tasks. The usual approach for splitting is using query cursors. The cron job would only trigger enqueueing the initial batch processing task, which would then enqueue an additional such task if there are remaining batches for processing.

To get a general idea of such a solution works take a peek at Google appengine: Task queue performance (it's python, but the general idea is the same).

Dan Cornilescu
  • 39,470
  • 12
  • 57
  • 97
  • Thats great idea of adding an assigned property into the item entity so when I loop over the item list I know that its assigned, but now if lets say this list of item is saved into a static variable, how to maintain this static variable updated given that the cloud can start/evict any instance whenever it needs? – Tamer Saleh Dec 07 '17 at 17:35
  • The whole idea is to avoid using such static variable - exactly because of the difficulties it raises. and storing such list in the datastore is problematic as it's a single 'hot' entity which would have to be updated at every batch, limiting your performance. If needed, you can pass the todo list (or rather what's left unassigned in it) from one task to the next. Or obtain it fresh via queries in every batch task. This de-couples the batch executions - they don't depend on any'centralized value, be it a static variable or a datastore entity, so you can run as many as you need in parallel. – Dan Cornilescu Dec 07 '17 at 17:47
  • "If needed, you can pass the todo list (or rather what's left unassigned in it) from one task to the next" thats what I was doing before, need to make them run in parallel to be faster, "Or obtain it fresh via queries in every batch task" cant do that too its gonna impact the performance with quering with every single batch task, my target is don't do any DB interaction during running the batches and at the same time maintain the items state – Tamer Saleh Dec 08 '17 at 04:08
  • May I use memcach using its putIfUntouhced method trying to put there the assigned items? so if the item already there I'll know then will not assign it any more? I was trying to do something like mc.putIfUnTouched("item_id", null, "user_id"); but it gives me this Exception, oldValue mustn't be null. Thanks in advance :) – Tamer Saleh Dec 08 '17 at 12:07
  • Memcache offers no protection for multiple threads (over)writing a shared object in the same time. I don't see how you can get such solution going. Good luck, tho! – Dan Cornilescu Dec 08 '17 at 13:55
  • I want to apply your approach but still I have some questions about in a previous comment, thank u :) – Tamer Saleh Dec 09 '17 at 01:48
  • Can you please ask them as separate questions, eventually referencing this one for context? Answering too many questions in one, including comments clutters things. Thx. – Dan Cornilescu Dec 09 '17 at 03:26
  • they are actually questions about ur approach :), so there is no other place to put them on other than as a comment here :) – Tamer Saleh Dec 09 '17 at 05:06
0

If you are planning for push jobs inside a cron and you want the jobs to be updating key-value pairs as an addon to improvise the speed and performance, we can split the number of users and number of items into multiple key-(list of values) pairs so that our push jobs will pick the key random ( logic to write to pick a key out of 4 or 5 keys) and then remove an item from the list of items and update the key again, try to have a locking before working on the above part. Example of key value paris. Userlist1: ["vijay",...] Userlist2: ["ramana",...]

vijay
  • 609
  • 5
  • 9
  • i cant pick an item randomly, I need to run a matching logic for each user against all items then pick up the best one for him – Tamer Saleh Dec 08 '17 at 03:51