Background:
I have 8 million independent documents in database to process. The documents have no dependency on each other, which means the process can be parallelized. After one document is processed, the result is saved back to database.
There are 6 machines for me to utilize.
Current solution
Documents are stored using one table in MySQL.
I am now partitioning the rows equally into 6 shares, each for one machine to process.
Drawbacks of current solution
Some partitions might take longer to process, thus leaving some machines busy when others idle.
Requirement
- I want to find a way/framework to load balance the tasks efficiently
- I am using Python as the data processing tool so that hopefully there are tools to fit for Python.