5

I'm building an application based around a task queue: it serves a series of tasks to multiple, asynchronously connected clients. The twist is that the tasks must be served in a random order.

My problem is that the algorithm I'm using now is computationally expensive, because it relies on many large queries and transfers from the database. I have a strong hunch that there's a cheaper way to achieve the same result, but I can't quite see the solution. Can you think of a clever fix for this problem?

Here's the (computationally expensive) algorithm I'm using now:

When the client queries for a new task...

  1. Query the database for "unfinished" tasks
  2. Put all tasks in a list
  3. Shuffle the list (using random.shuffle)
  4. Flag the first task as "in progress"
  5. Send the task parameters to the client for completion

When the client finishes the task...

6a. Record the result and flag the task as "finished."

If the client fails to finish the task by some deadline...

6b. Re-flag the task as "unfinished."

Seems like we could do better by replacing steps 1, 2, and 3, with pseudorandom sequences or hash functions. But I can't quite figure out the whole solution. Ideas?

Other considerations:

  • In case it's important, I'm using python and mongodb for all of this. (Mongodb doesn't have some clever "use find_one to efficiently return a random matching entry" usage, does it?)
  • The term "queue" is a little misleading. All the tasks are stored in subfields of a single collection within the mongodb. The length (total number of tasks) in the collection is known and fixed at the outset.
  • If it's necessary, it might be okay to let the same task be assigned multiple times, as long as the occurrence is rare. But instances of this kind would need to be very rare, because completing each task is costly.
  • I have identifying information on each client, so we know exactly who originates each task request.
Abe
  • 22,738
  • 26
  • 82
  • 111
  • If it's a list, why not generate a random number between 0 and list size and use this as a random index? This way you don't shuffle. Also, I am not sure about mongodb on this, but maybe you can even query for only a single element in one of two ways: If mongodb supports it, let it choose one randomly. If it doesn't, get all identifiers in a list and choose one randomly (maybe you can cache that to ease db stress?). – javex Jul 26 '12 at 14:31
  • This works for the first task, but on the second task, there's a chance that the random number will be the same as the one picked previously. As more and more tasks are completed, the probability of redundant picks goes up and up. – Abe Jul 26 '12 at 14:34
  • Does the database assign each task with a unique numerical value, where every value must be assigned (somewhat like an array)? E.g. a list with length 5 will have indexes 0-4 (or 1-5 if your array is 1-based) – Alyssa Haroldsen Jul 26 '12 at 14:35
  • I don't know what exactly you are referring to, but can't you limit the query to only those finished? And from any list you keep, delete the reference to the one finished (e.g. if you create a cache of IDs, remove the one you completed, can be done easily with lists). – javex Jul 26 '12 at 14:37

2 Answers2

1

There is an easy way to get a random document from MongoDB!

See Random record from MongoDB

If you don't want a task to be picked twice, you could mark the task as active and not select it.

Community
  • 1
  • 1
iblue
  • 29,609
  • 19
  • 89
  • 128
  • Ooh ooh! Does it work on *sub-elements* of documents? In my mongodb schema, I have a "queues" collection; each queue is a document; each task is a subdocument. – Abe Jul 26 '12 at 14:35
  • Some of the answers might. But rereading your post, a question arises: Why exactly must the tasks be served in random order? – iblue Jul 26 '12 at 14:40
  • Good question. It's a security concern: not all of my clients are trustworthy. In some cases, tasks are validated by comparing related tasks from different clients. Serving tasks in random order makes it very hard for different clients to collude in returning invalid results. – Abe Jul 26 '12 at 15:00
  • This solution isn't perfect, as you can see from reading the comments. It looks like the MongoDB team might eventually implement query functionality to random matches. – Abe Aug 29 '12 at 16:55
0

Ah, based on the comments that I missed, you can do something along these lines:

import random
available = range(lengthofdatabase)
inprogress = []
while len(available) > 0:
    taskindex = available.pop(random.randrange(0, len(available)))
    # I'm not sure of your implementation, but you said something 
    # along these lines was possible
    task = GetTask(taskindex)
    inprogress.append(taskindex)

I'm not sure of any of the functions you are using - this is just an algorithm.

Happy Coding!

Alyssa Haroldsen
  • 3,652
  • 1
  • 20
  • 35