i built myself a scraper. Having multiple targets on the same page i wanted to create a list which contains all 'url's' which then should get scraped. The scraping takes some time and i need to scrape them at the same time. Because i do not want to 'maintain' x Skripts for x url's i thougt of multiprocessing and spawning a process for each url in the 'list'. After some duckduckgo and reading for example here: https://keyboardinterrupt.org/multithreading-in-python-2-7/ and here: When should we call multiprocessing.Pool.join? i came up with the code provided.
Executed in a cmd line, the code Executes the main loop but without entering the scrape() function (inside would be some print messages which are not outputed). No Error message is given and the script exits like normal.
What am i missing?
I am using Python 2.7 on a win x64.
I already read:
Threading pool similar to the multiprocessing Pool?
https://docs.python.org/2/library/threading.html
https://keyboardinterrupt.org/multithreading-in-python-2-7/
but i didn't help.
def main():
try:
from multiprocessing import process
from multiprocessing.pool import ThreadPool
from multiprocessing import pool
thread_count = 10 # Define the limit of concurrent running threads
thread_pool = ThreadPool(processes=thread_count) # Define the thread pool to keep track of the sub processes
known_threads = {}
list=[]
list=def_list() # Just assigns the url's to the list
for entry in range(len(list)):
print 'starting to scrape'
print list[entry]
known_threads[entry] = thread_pool.apply_async(scrape, args=(list[entry]))
thread_pool.close() # After all threads started we close the pool
thread_pool.join() # And wait until all threads are done
except Exception, err:
print Exception, err, 'Failed in main loop'
pass