0

New to multiprocessing! please help.

All libraries are imported, get_links method works, I've tested it on a single case. Trying to make the method run for multiple urls that are designated to parallel processes to make it faster. Without multiprocessing my runtimes are 10 hours +

Edit 2:

Tried my best at a MCVE

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from bs4 import BeautifulSoup
from multiprocessing import Pool

options = Options()
options.headless = True
options.binary_location = 'C:\\Users\\Liam\\AppData\\Local\\Google\\Chrome SxS\\Application\\Chrome.exe'
options.add_argument('--blink-settings=imagesEnabled=false')
options.add_argument('--no-sandbox')
options.add_argument("--proxy-server='direct://'")
options.add_argument("--proxy-bypass-list=*")

subsubarea_urls = []
with open('subsubarea_urls.txt') as f:
    for item in f:
        item = item.strip()
        subsubarea_urls.append(item)

test_urls = subsubarea_urls[:3] 

def get_links(url):

    driver = webdriver.Chrome('....\Chromedriver', chrome_options=options)
    driver.get(url)
    soup = BeautifulSoup(driver.page_source, 'html.parser')
    link = soup.find(class_ = 'listings__all')
    if link is not None:
        link = "example.com" + link.find('a')['href']
    driver.close()
    return link

def main():

    how_many = 3
    p = Pool(processes = how_many)
    data = p.map(get_links, test_urls)
    p.close()

    with open('test_urls.txt', 'w') as f:
        f.write(str(data))

if __name__ == '__main__':
    main()
liamod
  • 316
  • 1
  • 9
  • Why do you use `[link for link in test_urls]` ?! you can just use `test_urls` – Amir Mar 13 '19 at 17:13
  • What is `get_links`? provide the code please – Amir Mar 13 '19 at 17:13
  • Perhaps something is messing up in get_links? Everything else seems fine, although I am unsure if you need to use `p.close()` – Zach Mar 13 '19 at 17:14
  • Yes I changed it to test_urls shortly after posting, provided the get_links code. – liamod Mar 13 '19 at 17:27
  • Im sure the links are fine, im using test links to get this part of the code working, they are ideal – liamod Mar 13 '19 at 17:28
  • You need to post an [MCVE](https://stackoverflow.com/help/mcve). Especially with multiprocessing, small mistakes can be missed. In the code above, you're referencing variables which have not been assigned (test_urls). This makes it impossible for us to debug. – calico_ Mar 13 '19 at 17:31
  • Added an attempt at an MCVE – liamod Mar 13 '19 at 17:44
  • Running in Windows 10 64 Bit, On a Jupyter notebook / spyder – liamod Mar 13 '19 at 18:18

1 Answers1

0

Unexpectedly the problem was not anything to do with the code. Multiprocessing in python does not seem to like Windows GUI's the sub processes called by Pool dont have std streams. The code needs to be executed in IDLE python -m idlelib.idle (To open IDLE)

See Terry Jan Reedy's answer here

liamod
  • 316
  • 1
  • 9