0

am having issues with my code when trying to run multiprocessing tasks using multiprocessing python library.

Here is my code: I have a function called extract_tag_data

def extract_tag_data(tag):
    search_bar.send_keys(tag)
    search_bar.send_keys(Keys.RETURN)

    for i in range (2):
        articles=driver.find_elements(By.XPATH, "//table[@class='table table-hover']/tbody/tr/td[2]/div[@class='media']/div[@class='media-body']/strong/a")
        for article in articles[:1]:
            article.click()
            dict['tag']=tag
            dict['article_title'].append(unidecode.unidecode(driver.find_element(By.XPATH,'//h1[@class="title"]').text))
            dict['abstract'].append(unidecode.unidecode(driver.find_element(By.XPATH,'//div[@class="abstract"]/div[1]').text))
            dict['authors'].append(unidecode.unidecode(",".join([element.text for element in (driver.find_elements(By.XPATH,'//div[@class="authors"]/span'))])))
            dict['structs'].append(unidecode.unidecode(",".join([element.text for element in (driver.find_elements(By.XPATH,'//div[@class="authors"]/div[@class="structs"]/div[@class="struct"]/a'))])))
            driver.back()
        driver.find_element(By.XPATH,'//table[@class="table table-hover"]/tfoot/tr[1]/th[2]/ul/li/a/span[@class="glyphicon glyphicon-step-forward"]').click()

and I want to run this task on tags list in parallel:

if __name__ == '__main__':
    with multiprocessing.get_context('spawn').Pool(3) as pool:
        pool.map(extract_tag_data, (tags))
        pool.close()
        driver.quit()
        df = pd.DataFrame(dict,columns=['article_title',  'authors',  'abstract','structs','tag'])
        df.to_excel(r"C:\\Users\\dell\\Desktop\\data collection\\myDataset.xlsx",  sheet_name='Sheet1')
        driver.quit()

but am getting the following error:

File "C:\Users\dell\miniconda3\lib\multiprocessing\spawn.py", line 134, in _check_not_importing_main
    raise RuntimeError('''
RuntimeError: 
        An attempt has been made to start a new process before the
        current process has finished its bootstrapping phase.

        This probably means that you are not using fork to start your
        child processes and you have forgotten to use the proper idiom
        in the main module:

            if __name__ == '__main__':
                freeze_support()
                ...

        The "freeze_support()" line can be omitted if the program
        is not going to be frozen to produce an executable.

[Done] exited with code=1 in 77.947 seconds

Booboo
  • 38,656
  • 3
  • 37
  • 60
abdo berg
  • 27
  • 6
  • 2
    Can you post the minimum example that can reproduce the error? This is most likely due to some code that's outside of the `if __name__ == '__main__':` block as the error suggests. – Tim Aug 31 '22 at 18:29
  • am not sure which part reproducing the error but this almost all my code the rest is just the initialisation of the webdriver and the other variables used – abdo berg Aug 31 '22 at 18:53
  • It's the initialization of the webdriver that is causing the problem. I assume you are using `selenium` and since `selenium` runs in its own process, you only need to be using a multithreading pool and each thread in the pool needs to initialize its onw `selenium` instance. Ideally, this webdriver is reusable by the thread for all the submitted tasks it will be processing. See [this post](https://stackoverflow.com/questions/53475578/python-selenium-multiprocessing) and my answer that ensures that the drivers are properly terminated. – Booboo Sep 02 '22 at 11:21

1 Answers1

1

Driver starts child process when pool process is created

A bit of a shot in the dark. I'm guessing that driver starts it's own subprocess when the module is loaded. This tricks the pool sub-process into thinking you have setup your multiprocessing code incorrectly. You should initialized driver under if clause and pass driver as an argument to the pool process.

SargeATM
  • 2,483
  • 14
  • 24
  • OP's goal may be to use multiple instance of drivers to do the processing simultaneously. In that case, it may be better to initialize driver with the `initializer` option. Otherwise, the different actions each process takes may result in conflict in for the shared driver. – Tim Aug 31 '22 at 21:15
  • thanks the one driver per process seems interesting But doesnt work. I tried to instanciate the driver inside the function extract_tag_data but doesnt work. I also tried initializing the driver inside if main close but also doest solve the problem. – abdo berg Aug 31 '22 at 22:02
  • If you comment out the driver code, do you still get a `RuntimeError`? – SargeATM Aug 31 '22 at 22:22
  • Yes I tried a simple function doing just printing and I got the same error – abdo berg Sep 01 '22 at 21:09