I am using ThreadPoolExecutor
in order to download a huge (~400k) amount of keyframe images. Keyframes names are stored in text file (let's say keyframes_list.txt).
I have modified the example provided in the documentation and it seems to work flawlessly with one exception: as it is clear the example passes every link to a future
object which are all passed to an iterable (dict()
to be precise). This iterable is passed as argument to as_completed()
function to check when a future
is completed. This of course requires a huge amount of text loaded at once in memory. My python process for this task takes up 1GB of RAM.
The full code is shown below:
import concurrent.futures
import requests
def download_keyframe(keyframe_name):
url = 'http://server/to//Keyframes/{}.jpg'.format(keyframe_name)
r = requests.get(url, allow_redirects=True)
open('path/to/be/saved/keyframes/{}.jpg'.format(keyframe_name), 'wb').write(r.content)
return True
keyframes_list_path = '/path/to/keyframes_list.txt'
future_to_url = {}
with concurrent.futures.ThreadPoolExecutor(max_workers=8) as executor:
with open(keyframes_list_path, 'r') as f:
for i, line in enumerate(f):
fields = line.split('\t')
keyframe_name = fields[0]
future_to_url[executor.submit(download_keyframe, keyframe_name)] = keyframe_name
for future in concurrent.futures.as_completed(future_to_url):
keyframe_name = future_to_url[future]
try:
future.result()
except Exception as exc:
print('%r generated an exception: %s' % (keyframe_name, exc))
else:
print('Keyframe: {} was downloaded.'.format(keyframe_name))
So, my question is how could I provide both an iterable and also keep memory footprint low. I have considered using queue
but I am not sure it's cooperating with ThreadPoolExecutor
smoothly. Is there an easy way to control the amount of future
s submitted to ThreadPoolExecutor
?