The program/script I've made works on OSX and linux. It uses selenium to scrape data from some pages, manipulates the data and saves it. In order to be more efficient, I included the multiprocessing pool and manager. I create a pool, for each item in a list, it calles the scrap class, starts a phantomjs instance and scrapes. Since I'm using multiprocessing.pool, and I want a way to pass data between the threads, I read that multiprocessing.manager was the way forward. If I wrote
manager = Manager()
info = manager.dict([])
it would create a dict that could be accessed by all threads. It all worked perfectly.
My issue is that the client wants to run this on a windows machine (I wrote the entire thing on OSX) I assumed, it would be as simple as installing python, selenium and launching it. I had errors which later lead me to writing if __name__ == '__main__:
at the top of my main.py file, and indenting everything to be inside. The issue is, when I have class scrape():
outside of the if statement, it cannot see the global info, since it is declared outside of the scope. If I insert the class scrape():
inside the if __name__ == '__main__':
then i get an attribute error saying
AttributeError: 'module' object has no attribute 'scrape'
And if I go back to declaring manager = manager() and info = manager.dict([]) outside of the if __name__ == '__main__'
then I get the error in windows about making sure I use if __name__ == '__main__'
it doesn't seem like I can win with this project at the moment.
Code Layout...
Imports...
from multiprocessing import Pool
from multiprocessing import Manager
manager = Manager()
info = manager.dict([])
date = str(datetime.date.today())
class do_scrape():
def __init__():
def...
def scrape_items():#This contains code which creates a pool and then pool.map(do_scrape, s) s = a list of items
def save_scrape():
def update_price():
def main():
main()
Basically, the scrape_items is called by main, then scrape_items uses pool.map(do_scrape, s) so it calls the do_scrape class and passes the list of items to it one by one. The do_scrape then scrapes a web page based on the item url in "s" then saves that info in the global info which is the multiprocessing.manager dict. The above code does not show any if __name__ == '__main__':
statements, it is an outline of how it works on my OSX setup. It runs and completes the task as is. If someone could issue a few pointers, I would appreciate it. Thanks