0

Background:

I have 8 million independent documents in database to process. The documents have no dependency on each other, which means the process can be parallelized. After one document is processed, the result is saved back to database.

There are 6 machines for me to utilize.

Current solution

Documents are stored using one table in MySQL.

I am now partitioning the rows equally into 6 shares, each for one machine to process.

Drawbacks of current solution

Some partitions might take longer to process, thus leaving some machines busy when others idle.

Requirement

  1. I want to find a way/framework to load balance the tasks efficiently
  2. I am using Python as the data processing tool so that hopefully there are tools to fit for Python.
xiaohan2012
  • 9,870
  • 23
  • 67
  • 101
  • https://code.google.com/p/distributed-python-for-scripting/downloads/list this one helped me a lot, and is very easy to use – usethedeathstar Jun 18 '14 at 07:40
  • How large are your documents and how fast do you expect each one to be processed? – David Marek Jun 18 '14 at 07:43
  • try doing a round robin streamed distribution instead of splitting the data into 6 at the start – omu_negru Jun 18 '14 at 07:55
  • 1
    What you want is called a simple [job queue](http://en.wikipedia.org/wiki/Job_queue). You could try [RQ](http://python-rq.org/), which is really easy if you've got a redis installation somewhere. If you want something more complex, have a look at [Celery](http://www.celeryproject.org/). – Carsten Jun 18 '14 at 07:55

1 Answers1

0

Instead of partitioning the rows into equal shares, you should add a progress field to your document table (you should really create a progress table and do an outer join, but you'll get the idea). Then in each worker you start by booking a block of documents (assuming id is the primary key and WORKERID is a unique id for each worker):

update documents set progress = WORKERID
where progress is null
limit 1000

(if you're using an old version of MySQL you might need more convoluted syntax: update multiple rows using limit in mysql?)

Make sure your limit is large enough that your database has some breathing room, but not so large that you'll have to wait for the final worker to finish his 10000th document :-)

Then you can start iterating:

c.execute("select * from documents where progress = %s", my_worker_id)
for doc in c.fetchall():
    # process doc..
    c.execute("update documents set progress = 'done' where id = %s", doc.id)
    cn.commit()

This has the advantage of being really easy to implement, really fast, and gives you the ability to restart any worker etc.

If you only have 6 workers, I'd just start them by hand.

oh, and you can manually keep an eye on the progress using a simple sql statement:

select progress, count(*)
from documents
group by progress
order by progress

which will give you how many are left (progress = null), how many are finished (progress = 'done') and how many is in each worker's queue (progress = WORKERID).

Community
  • 1
  • 1
thebjorn
  • 26,297
  • 11
  • 96
  • 138
  • Thanks @thebjorn. But this does not answer my question, unfortunately. I would like something that can do the load balancing stuff. Or something similar to a round robin approach. – xiaohan2012 Jun 18 '14 at 11:19
  • @xiaohan2012 actually, this will do the load balancing automatically (assuming your limit is reasonable). Workers will book and process work at the speed that they can process. This approach is better (more parallellism) than a round-robin approach, since in round-robin you need to give a slice to a worker even if he is not finished with the last slice -- which can lead to an increasing backlog of slices for a slow worker. If you want something that automatically starts your processes, then try e.g. fabric. – thebjorn Jun 18 '14 at 14:05
  • @xiaohan2012 ... and also, a round-robin approach needs some kind of central authority, which can become a bottleneck (although probably not with only six server...) – thebjorn Jun 18 '14 at 14:07