Here is the code:
// database_extractor.py
class DatabaseExtractor(object):
def __init__(self, ..):
...
def run_extraction(self):
// run sql query to extract data to a file
//driver.py
def extract__func(db_extractor):
db_extractor.run_extraction()
if __name__ == "__main__":
db1 = DatabaseExtractor(..)
db2 = DatabaseExtractor(..)
db3 = DatabaseExtractor(..)
db4 = DatabaseExtractor(..)
db5 = DatabaseExtractor(..)
db6 = DatabaseExtractor(..)
db7 = DatabaseExtractor(..)
db8 = DatabaseExtractor(..)
worker_l = [Process(extract_func, args=[db1]),
Process(extract_func, args=[db2]),
Process(extract_func, args=[db3]),
Process(extract_func, args=[db4]),
Process(extract_func, args=[db5]),
Process(extract_func, args=[db6]),
Process(extract_func, args=[db7]),
Process(extract_func, args=[db8])]
for worker in worker_l: worker.start()
for worker in worker_l: worker.join()
(In reality, the instances of DatabaseExtractor
are being generated based on an input config file, so there could be more than 8 processes running)
I referred to the SO post: Reference, quoting the accepted answer "You'll either want to join your processes individually outside of your for loop (e.g., by storing them in a list and then iterating over it) or use something like numpy.Pool and apply_async with a callback". Even though I did the same, all my processes are running sequentially. The reason I know this is because 4 of the instances have queries running for couple of hours and when one of them is kicked off, I do not see the other queries populating their respective output file. How can I force parallel execution of the instances?