1

I am using Python Multiprocessing module to scrape a website. Now this website has over 100,000 pages. What I am trying to do is to put every 500 pages I retrieve into a separate folder. The problem is that though I successfully create a new folder, my script only populates the previous folder. Here is the code:

global a = 1

global b = 500

def fetchAfter(y):

       global a

       global b

       strfile = "E:\\A\\B\\" + str(a) + "-" + str(b) + "\\" + str(y) + ".html"

       if (os.path.exists( os.path.join( "E:\\A\\B\\" + str(a) + "-" + str(b) + "\\", str(y) + ".html" )) == 0):

                f = open(strfile, "w")


if __name__ == '__main__':

       start = time.time()
       for i in range(1,3):
              os.makedirs("E:\\Results\\Class 9\\" + str(a) + "-" + str(b))

              pool = Pool(processes=12)
              pool.map(fetchAfter, range(a,b))
              pool.close()
              pool.join()
              a = b
              b = b + 500

       print time.time()-start
mgilson
  • 300,191
  • 65
  • 633
  • 696
user1343318
  • 2,093
  • 6
  • 32
  • 59
  • Why multiprocessing? Why not multithreading? – Joel Cornett Aug 23 '12 at 19:12
  • *off topic:* There is really no need to use the `global` keyword. As far as I can tell, removing it from your script won't change a thing. – mgilson Aug 23 '12 at 19:13
  • Joel -- Because of some GIL thing. I have multiple cores. – user1343318 Aug 23 '12 at 20:24
  • mgilson -- you are right. But can you help me with my third last comment below? – user1343318 Aug 23 '12 at 20:25
  • @user1343318: I'm assuming that your program is IO-bound, since you mention in the title that you're using it to scrape websites. In that case, the GIL should not significantly affect execution time. The reason I mention threading versus multiprocessing is because of the overhead of creating new processes. Of course, if you want to take advantage of multiple cores, than multiprocessing is the way to go, but make sure that using multiple cores *will* increase performance before going that route. – Joel Cornett Aug 23 '12 at 20:34
  • 1
    @user1343318: The owner of the website might not appreciate it if you start scraping his site at full speed... – Roland Smith Aug 23 '12 at 20:37
  • @RolandSmith: That's why I suggested using threads. Decreased memory overhead + IO bound execution. – Joel Cornett Aug 23 '12 at 20:38
  • 1
    @JoelCornett: The process creation overhead of `multiprocessing.Pool` is not very high if you're scraping 100000 pages; by default it only creates the processes once at the beginning of the run, and executes the worker function repeatedly in each of the children. – Roland Smith Aug 23 '12 at 20:42
  • @RolandSmith: You're right, I just wanted to point out that *assuming* that multiprocessing is always better than multithreading is not a good idea. – Joel Cornett Aug 23 '12 at 20:45
  • 1
    @JoelCornett: It seems to me that the _usual advice_ that goes around if people ask how to speed up their program is to use threads. Depending on what one is trying to accomplish it could very well be that using an event-driven architecture (`select()` loop or `gevent` greenlets) is far superior than either threads or multiprocessing. Not to mention using an other python interpreter. – Roland Smith Aug 23 '12 at 21:17
  • have you considered wget, scrapy? – jfs Aug 23 '12 at 22:35

3 Answers3

1

It is best for the worker function to only rely on the single argument it gets for determining what to do. Because that is the only information it gets from the parent process every time it is called. This argument can be almost any Python object (including a tuple, dict, list) so you're not really limited in the amount of information you pass to a worker.

So make a list of 2-tuples. Each 2-tuple should consist of (1) the file to get and (2) the directory where to stash it. Feed that list of tuples to map(), and let it rip.

I'm not sure if it is useful to specify the number of processes you want to use. Pool generally uses as many processes as your CPU has cores. That is usually enough to max out all the cores. :-)

BTW, you should only call map() once. And since map() blocks until everything is done, there is no need to call join().

Edit: Added example code below.

import multiprocessing
import requests
import os

def processfile(arg):
    """Worker function to scrape the pages and write them to a file.

    Keyword arguments:
    arg -- 2-tuple containing the URL of the page and the directory
           where to save it.
    """
    # Unpack the arguments
    url, savedir = arg

    # It might be a good idea to put a random delay of a few seconds here, 
    # so we don't hammer the webserver!

    # Scrape the page. Requests rules ;-)
    r = requests.get(url)
    # Write it, keep the original HTML file name.
    fname = url.split('/')[-1]
    with open(savedir + '/' + fname, 'w+') as outfile:
        outfile.write(r.text)

def main():
    """Main program.
    """
    # This list of tuples should hold all the pages... 
    # Up to you how to generate it, this is just an example.
    worklist = [('http://www.foo.org/page1.html', 'dir1'), 
                ('http://www.foo.org/page2.html', 'dir1'), 
                ('http://www.foo.org/page3.html', 'dir2'), 
                ('http://www.foo.org/page4.html', 'dir2')]
    # Create output directories
    dirlist = ['dir1', 'dir2']
    for d in dirlist:
        os.makedirs(d)
    p = Pool()
    # Let'er rip!
    p.map(processfile, worklist)
    p.close()

if __name__ == '__main__':
    main()
Roland Smith
  • 42,427
  • 3
  • 64
  • 94
0

Multiprocessing, as the name implies, uses separate processes. The processes you create with your Pool do not have access to the original values of a and b that you are adding 500 to in the main program. See this previous question.

The easiest solution is to just refactor your code so that you pass a and b to fetchAfter (in addition to passing y).

Community
  • 1
  • 1
BrenBarn
  • 242,874
  • 37
  • 412
  • 384
  • Okey, I tried to pass pool.map(fetchAfter, range(a,b), a, b) but now it says map takes at most 4 arguments, 5 given. P.S.: That is the reason I resorted to globals in the first place. – user1343318 Aug 23 '12 at 20:17
  • The simplest way is to do as @Roland Smith suggests, and pass a single tuple containing the range, a, and b. – BrenBarn Aug 24 '12 at 02:03
0

Here's one way to implement it:

#!/usr/bin/env python
import logging
import multiprocessing as mp
import os
import urllib

def download_page(url_path):
    try:
        urllib.urlretrieve(*url_path)
        mp.get_logger().info('done %s' % (url_path,))
    except Exception as e:
        mp.get_logger().error('failed %s: %s' % (url_path, e))

def generate_url_path(rootdir, urls_per_dir=500):
    for i in xrange(100*1000):
        if i % urls_per_dir == 0: # make new dir
           dirpath = os.path.join(rootdir, '%d-%d' % (i, i+urls_per_dir))
           if not os.path.isdir(dirpath):
              os.makedirs(dirpath) # stop if it fails
        url = 'http://example.com/page?' + urllib.urlencode(dict(number=i))
        path = os.path.join(dirpath, '%d.html' % (i,))
        yield url, path

def main():
    mp.log_to_stderr().setLevel(logging.INFO)

    pool = mp.Pool(4) # number of processes is unrelated to number of CPUs
                      # due to the task is IO-bound
    for _ in pool.imap_unordered(download_page, generate_url_path(r'E:\A\B')):
        pass

if __name__ == '__main__':
   main()

See also Python multiprocessing pool.map for multiple arguments and the code
Brute force basic http authorization using httplib and multiprocessing
from how to make HTTP in Python faster?

Community
  • 1
  • 1
jfs
  • 399,953
  • 195
  • 994
  • 1,670