0

i built myself a scraper. Having multiple targets on the same page i wanted to create a list which contains all 'url's' which then should get scraped. The scraping takes some time and i need to scrape them at the same time. Because i do not want to 'maintain' x Skripts for x url's i thougt of multiprocessing and spawning a process for each url in the 'list'. After some duckduckgo and reading for example here: https://keyboardinterrupt.org/multithreading-in-python-2-7/ and here: When should we call multiprocessing.Pool.join? i came up with the code provided. Executed in a cmd line, the code Executes the main loop but without entering the scrape() function (inside would be some print messages which are not outputed). No Error message is given and the script exits like normal. What am i missing?
I am using Python 2.7 on a win x64.
I already read:
Threading pool similar to the multiprocessing Pool?
https://docs.python.org/2/library/threading.html
https://keyboardinterrupt.org/multithreading-in-python-2-7/
but i didn't help.

def main():
    try:
        from multiprocessing import process
        from multiprocessing.pool import ThreadPool
        from multiprocessing import pool
        thread_count = 10 # Define the limit of concurrent running threads
        thread_pool = ThreadPool(processes=thread_count) # Define the thread pool to keep track of the sub processes
        known_threads = {}
        list=[]
        list=def_list() # Just assigns the url's to the list
        for entry in range(len(list)):
            print 'starting to scrape'
            print list[entry]
            known_threads[entry] = thread_pool.apply_async(scrape, args=(list[entry]))
        thread_pool.close() # After all threads started we close the pool
        thread_pool.join() # And wait until all threads are done
        except Exception, err:
            print Exception, err, 'Failed in main loop'
        pass
Pasa
  • 652
  • 6
  • 13
J.Doe
  • 127
  • 1
  • 1
  • 9
  • 1
    Why would you closer the pool before joining the threads? – Daniel Roseman Jan 22 '19 at 23:41
  • when joining the threads before closing them it throws an exeption: " Failed in main loop". It also is done in this sequence in the example under: https://keyboardinterrupt.org/multithreading-in-python-2-7/ where i got most of this code. – J.Doe Jan 22 '19 at 23:50
  • Why are you silencing every error? Don't do that and debug what is actually going on. – juanpa.arrivillaga Jan 23 '19 at 00:01
  • i am not aware of silencing an Error. Can you clarify how i am silencing the Error? – J.Doe Jan 23 '19 at 00:03
  • 1
    wrapping your whole code in `try: except Exception`. You said "when join threds before closing them it throws exception", and then you fixed it, essentially, by not executing anything. Instead, you should *fix the actual exception you weren't expecting*. – juanpa.arrivillaga Jan 23 '19 at 00:33
  • Does that mean wrapping it in try: except does not catch all errors? If so what is the right way to catch exceptions? The exeption occures when i am messing with the sequence of the 'touturial'. So naturally not knowing much about this, i have to assume that the order from the 'tutorial' which throws no eror is the right one. Allthoug waiting for the threads to end and then close them sounds sane to me, it didn't work with that simple change either. How would you start to debug with – J.Doe Jan 23 '19 at 07:26

0 Answers0