1

The program/script I've made works on OSX and linux. It uses selenium to scrape data from some pages, manipulates the data and saves it. In order to be more efficient, I included the multiprocessing pool and manager. I create a pool, for each item in a list, it calles the scrap class, starts a phantomjs instance and scrapes. Since I'm using multiprocessing.pool, and I want a way to pass data between the threads, I read that multiprocessing.manager was the way forward. If I wrote manager = Manager() info = manager.dict([]) it would create a dict that could be accessed by all threads. It all worked perfectly.

My issue is that the client wants to run this on a windows machine (I wrote the entire thing on OSX) I assumed, it would be as simple as installing python, selenium and launching it. I had errors which later lead me to writing if __name__ == '__main__: at the top of my main.py file, and indenting everything to be inside. The issue is, when I have class scrape(): outside of the if statement, it cannot see the global info, since it is declared outside of the scope. If I insert the class scrape(): inside the if __name__ == '__main__': then i get an attribute error saying

AttributeError: 'module' object has no attribute 'scrape'

And if I go back to declaring manager = manager() and info = manager.dict([]) outside of the if __name__ == '__main__' then I get the error in windows about making sure I use if __name__ == '__main__' it doesn't seem like I can win with this project at the moment.

Code Layout...

Imports...
from multiprocessing import Pool
from multiprocessing import Manager

manager = Manager()
info = manager.dict([])
date = str(datetime.date.today())

class do_scrape():
    def __init__():
    def...

def scrape_items():#This contains code which creates a pool and then pool.map(do_scrape, s) s = a list of items

def save_scrape():

def update_price():

def main():

main()

Basically, the scrape_items is called by main, then scrape_items uses pool.map(do_scrape, s) so it calls the do_scrape class and passes the list of items to it one by one. The do_scrape then scrapes a web page based on the item url in "s" then saves that info in the global info which is the multiprocessing.manager dict. The above code does not show any if __name__ == '__main__': statements, it is an outline of how it works on my OSX setup. It runs and completes the task as is. If someone could issue a few pointers, I would appreciate it. Thanks

tshepang
  • 12,111
  • 21
  • 91
  • 136
Adam.J
  • 2,519
  • 3
  • 14
  • 12
  • add some framework of your code showing position of class manager,class scrape,global variables and line where you are using :if __name__ == '__main__ – hemraj Jul 04 '14 at 15:43
  • 1
    You should revisit your reasoning for adding `if __name__ == '__main__:`. I can't see a reason this would only be needed on Windows, and most likely there's a better way. – Martin Konecny Jul 04 '14 at 15:58
  • @MartinKonecny When I ran the above code on a windows machine, the error I was getting was "Attempt to start a new process before the current process has finished its bootstrapping phase. This probably means that you are on Windows and you have forgotten to use the proper idiom in the main module: if __name__ == '__main__': free_support() ... – Adam.J Jul 04 '14 at 16:14
  • @MartinKonecny THe `if __name__` guard is absolutely required on Windows in places it's not on other platforms. It explains it in [the docs](https://docs.python.org/2/library/multiprocessing.html#windows) – dano Jul 04 '14 at 16:19

2 Answers2

0

It would be helpful to see your code, but its sounds like you just need to explicitly pass your shared dict to scrape, like this:

import multiprocessing
from functools import partial

def scrape(info, item):
   # Use info in here

if __name__ == "__main__":
   manager = multiprocessing.Manager()
   info = manager.dict()
   pool = multiprocessing.Pool()
   func = partial(scrape, info) # use a partial to make it easy to pass the dict to pool.map
   items = [1,2,3,4,5] # This would be your actual data
   results = pool.map(func, items)
   #pool.apply_async(scrape, [shared_dict, "abc"]) # In case you're not using map...

Note that you shouldn't put all your code inside the if __name__ == "__main__": guard, just the code that's actually creating processes via multiprocessing, this includes creating the Manager and the Pool.

Any method you want to run in a child process must be declared at the top level of the module, because it has to be importable from __main__ in the child process. When you declared scrape inside the if __name__ ... guard, it could no longer be imported from the __main__ module, so you saw the AttributeError: 'module' object has no attribute 'scrape' error.

Edit:

Taking your example:

import multiprocessing
from functools import partial

date = str(datetime.date.today())

#class do_scrape():
#    def __init__():
#    def...
def do_scrape(info, s):
    # do stuff
    # Also note that do_scrape should probably be a function, not a class

def scrape_items():
    # scrape_items is called by main(), which is protected by a`if __name__ ...` guard 
    # so this is ok.
    manager = multiprocessing.Manager()
    info = manager.dict([])
    pool = multiprocessing.Pool()
    func = partial(do_scrape, info) 
    s = [1,2,3,4,5] # Substitute with the real s
    results = pool.map(func, s)     

def save_scrape():

def update_price():

def main():
    scrape_items()

if __name__ == "__main__": 
    # Note that you can declare manager and info here, instead of in scrape_items, if you wanted
    #manager = multiprocessing.Manager()
    #info = manager.dict([])
    main()

One other important note here is that the first argument to map should be a function, not a class. This is stated in the docs (multiprocessing.map is meant to be equivalent to the built-in map).

dano
  • 91,354
  • 19
  • 222
  • 219
  • Thanks for this, I am however new to python and don't fully understand the solution. I tried the partial function from functools, but no joy. – Adam.J Jul 04 '14 at 16:16
  • @user3387507 Can you be more specific about what "no joy" means? – dano Jul 04 '14 at 16:20
  • I can't since I don't fully understand the solution. I'm clearly doing something wrong. – Adam.J Jul 04 '14 at 16:27
  • @user3387507 I updated my answer to look more like your example code. Maybe that will help. – dano Jul 04 '14 at 16:31
  • Thanks for the edited code. "do_scrape" should be scrape, i changed it because i read that it might be a solution if I'd ever har a file called scrape.py in the directory. the scrape class has different functions for scraping different things etc... Thats why its a class. This solution seems the closest, however I get a pickling error now. "multiprocessing.pool.MaybeEncodingError: Error sending result '.scrape instance' Reason UnpickleableError – Adam.J Jul 04 '14 at 21:14
  • I think you're getting that because you're trying to call `map` on a class instead of a function. The class apparently isn't pickleable in its current form. When you call `map` on the class, an instance of the class gets returned, which `map` will try to return to the parent process by pickling it. Since the class isn't pickleable, you get the error you see. You should really be calling `map` on a function, even if all that function does is instantiate an instance of `scrape`. – dano Jul 04 '14 at 22:46
  • Thanks so much! The solution above worked perfectly, after I created a new function do_scrape() which instantiated an instance of scrape(). Works like a charm. Thanks! – Adam.J Jul 05 '14 at 07:40
0

Find the starting point of your program, and make sure you wrap only that with your if statement. For example:

Imports...
from multiprocessing import Pool
from multiprocessing import Manager

manager = Manager()
info = manager.dict([])
date = str(datetime.date.today())

class do_scrape():
    def __init__():
    def...

def scrape_items():#This contains code which creates a pool and then pool.map(do_scrape, s) s = a list of items

def save_scrape():

def update_price():

def main():

if __name__ == "__main__":
    main()

Essentially the contents of the if are only executed if you called this file directly when running your python code. If this file/module is included as an import from another file, all attributes will be defined, so you can access various attributes without actually beginning execution of the module.

Read more here: What does if __name__ == "__main__": do?

Community
  • 1
  • 1
Martin Konecny
  • 57,827
  • 19
  • 139
  • 159
  • I tried this solution and I still get the same error about needing if __name__ == '__main__': – Adam.J Jul 04 '14 at 16:23
  • Make sure you wrap the code that starts your multi process with the `if __name__ == '__main__'` block – Martin Konecny Jul 04 '14 at 16:24
  • Just to clarify - if you wrap your entire code block with `if name == main`, then the file is essentially empty/useless if imported from another module (the `if` statement will evaluate to false). If you place your `if` statement simply to prevent automatic execution, all the attributes/functions will be visible, and you will avoid the problem where execution of your module begins before you want it to. – Martin Konecny Jul 04 '14 at 16:39