0
import lxml.html
import mechanize, cookielib
import multiprocessing

browser = None

def download(i):
    link = 'www.google.com'
    response = browser.open(link)
    tree = lxml.html.parse(response)
    print tree
    return 0

if __name__ == '__main__':    
    browser = mechanize.Browser()
    cookie_jar = cookielib.LWPCookieJar()
    browser.set_cookiejar(cookie_jar)
    browser.set_handle_equiv(True)
    browser.set_handle_gzip(True)
    browser.set_handle_redirect(True)
    browser.set_handle_referer(False) #inicialmente estava on mas deve ser melhor off
    browser.set_handle_robots(False)
    browser.set_handle_refresh(mechanize._http.HTTPRefreshProcessor(), max_time=1)
    browser.addheaders = [('User-agent', 'Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:2.0.1) Gecko/20100101 Ubuntu/11.04 maverick Firefox/4.0.1')]

    pool = multiprocessing.Pool(None)
    tasks = range(8)
    r = pool.map_async(download, tasks)
    r.wait() # Wait on the results

If I remove the multiprocessing part, it works. If I don't call the browser inside the download function, it also works. However, it seems as if multiprocessing + mechanize is simply not working.

How can I fix this? It doesn't happen under linux.

user975982
  • 37
  • 4

2 Answers2

0

I would try to:

  • remove browser = None or
  • move the code in the __name__=="__main__" into main() function and add global browser before browser=mechanize.Browser() or
  • move code that initializes browser to an initializer

If your tasks are I/O bound then you don't necessarily need multiprocessing to make concurrent requests. For example, you could use concurrent.futures.ThreadPoolExecutor, gevent, Twisted instead.

Related: Problem with multi threaded Python app and socket connections

Community
  • 1
  • 1
jfs
  • 399,953
  • 195
  • 994
  • 1,670
0

Only the main process executes the gated if __name__ == '__main__' block. Since Windows lacks a fork system call, each process created in the pool needs its own browser. You can do this with an initializer function. For reference, see the initializer and initargs options of multiprocessing.Pool.

import lxml.html
import mechanize, cookielib
import multiprocessing as mp

def download(i):
    link = 'http://www.google.com'
    response = browser.open(link)
    tree = lxml.html.parse(response)
    print tree
    return 0

def init(count):
    global browser
    browser = mechanize.Browser()
    cookie_jar = cookielib.LWPCookieJar()
    browser.set_cookiejar(cookie_jar)
    browser.set_handle_equiv(True)
    browser.set_handle_gzip(True)  #warning
    browser.set_handle_redirect(True)
    browser.set_handle_referer(False)
    browser.set_handle_robots(False)
    browser.set_handle_refresh(mechanize._http.HTTPRefreshProcessor(), 
                               max_time=1)
    browser.addheaders = [('User-agent', 
        'Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:2.0.1) '
        'Gecko/20100101 Ubuntu/11.04 maverick Firefox/4.0.1')]

    count.value -= 1

if __name__ == '__main__':
    import time
    count = mp.Value('I', mp.cpu_count())
    pool = mp.Pool(count.value, initializer=init, initargs=(count,))
    #wait until all processes are initialized
    while count.value > 0:
        time.sleep(0.1)

    tasks = range(8)
    r = pool.map_async(download, tasks)
    r.wait()
Eryk Sun
  • 33,190
  • 5
  • 92
  • 111