0

I am currently working on developing a web crawler in python to that will take advantage of multiple cpu cores. This is my first endeavor into multiprocessing and I am encountering a few problems both in implementation and concept. So far I am just spawning one process for the sake of development that will crawl a website and get the content that I want from it, this is not problem.

My problems begin when I want to send back the urls parsed by a child from a page to a queue to have all the child processes be able to access this queue to get more work.

With this, I am not sure how to go about starting this program with multiple processes, have them continually get/do work, and once done have them join(). Since the number of urls is obviously a large and unknown number I can't set a number of things done before termination.

I welcome any and all critiques and criticisms as I am a budding python developer. I also know that this is a rather involved question and do greatly appreciate the community's help!

Please do not suggest that I use some existing web crawler, this is a learning experience for me more than anything else and so using some existing software to crawl for me is not going to be helpful.

Below is my code so far to give you an idea of what I have and what I understand.

import multiprocessing as mp
import re
import subprocess
import time

from bs4 import BeautifulSoup as bs4
import requests


class CrawlParseProcess(mp.Process):
    def __init__(self, url):
        mp.Process.__init__(self)
        # This implies that a new process per URL which is not smart.
        # Instead should the link queue be passed or something?
        self.url = url

    def run(self):
        proc_name = self.name
        response = requests.get(self.url)
        links = self.link_parse(response.text, self.url)

    def link_parse(self, html, url):
        soup = bs4(html)
        for link in soup.find_all('a'):
            if link.has_attr('href'):
                link = link.get('href')
            if str(link).startswith('http') and 'example.com' in link:
                print link
        pattern = re.compile('http://www.example.com/.+\d/*')
        if pattern.search(url):
            return_obj = self.parse(html)
            return_obj["date"] = time.time()

    def parse(self, html):
        soup = bs4(html)
        return_obj = {
            "url" : self.url,
            "title" : soup.find('h1', {'itemprop' : 'name'}).text,
            "date-added" : soup.find('div', {'class' : 'addedwhen clrfix'}).text,
            "p" : soup.find('p', {'class' : 'time1'}).text,
            "c" : soup.find('p', {'class' : 'time2'}).text,
            "t" : soup.find('h3', {'class' : 'duration'}).text,
            "items" : soup.find_all('li', {'itemprop' : 'items'}),
         }
        return return_obj


if __name__ == '__main__':
    p = CrawlParseProcess("www.example.com")
    p.start()
    p.join()
gaigepr
  • 905
  • 1
  • 10
  • 17

1 Answers1

2

Basically you new to start a pool of worker process. All of them reading from a blocking queue.


For details or alternatives, there are many other questions on that topic on SO. As an example, see

Community
  • 1
  • 1
Sylvain Leroux
  • 50,096
  • 7
  • 103
  • 125