There are many ways to do it - the simplest way would be to just use multiprocessing.Pool
and let it organize the workers for you - 10k rows is not all that much, let's say that an average URL is even a full kilobyte long it will still take only 10MB of memory and memory is cheap.
So, just read the file in memory and map it to multiprocessing.Pool
to do your bidding:
from multiprocessing import Pool
def downloader(param): # our downloader process
# download code here
# param will hold a line from your file (including newline at the end, strip before use)
# e.g. res = requests.get(param.strip())
return True # lets provide some response back
if __name__ == "__main__": # important protection for cross-platform use
with open("your_file.dat", "r") as f: # open your file
download_jobs = f.readlines() # store each line in a list
download_pool = Pool(processes=5) # make our pool use 5 processes
responses = download_pool.map(downloader, download_jobs) # map our data, line by line
download_pool.close() # lets exit cleanly
# you can check the responses for each line in the `responses` list
You can also use threading
instead of multiprocessing
(or multiprocessing.pool.ThreadPool
as a drop-in replacement for this) to do everything within a single process if you need shared memory. A single thread is more than enough for download purposes unless you're doing additional processing.
UPDATE
If you want your downloaders to run as class instances, you can transform the downloader
function into a factory for your Downloader
instances, and then just pass what you need to instantiate those instances alongside your URLs. Here is a simple Round-Robin approach:
from itertools import cycle
from multiprocessing import Pool
class Downloader(object):
def __init__(self, port_number=8080):
self.port_number = port_number
def run(self, url):
print("Downloading {} on port {}".format(url, self.port_number))
def init_downloader(params): # our downloader initializator
downloader = Downloader(**params[0]) # instantiate our downloader
downloader.run(params[1]) # run our downloader
return True # you can provide your
if __name__ == "__main__": # important protection for cross-platform use
downloader_params = [ # Downloaders will be initialized using these params
{"port_number": 7751},
{"port_number": 7851},
{"port_number": 7951}
]
downloader_cycle = cycle(downloader_params) # use cycle for round-robin distribution
with open("your_file.dat", "r") as f: # open your file
# read our file line by line and attach downloader params to it
download_jobs = [[next(downloader_cycle), row.strip()] for row in f]
download_pool = Pool(processes=5) # make our pool use 5 processes
responses = download_pool.map(init_downloader, download_jobs) # map our data
download_pool.close() # lets exit cleanly
# you can check the responses for each line in the `responses` list
Keep in mind that this is not the most balanced solution as it can happen to have two Downloader
instances with the same port running, but it will average over large enough data.
If you want to make sure that you don't have two Downloader
instances running off of the same port, you'll either need to build your own pool, or you'll need to create a central process that will issue ports to your Downloader
instances when they need them.