How to use multiprocessing in Python to crawl millions of urls using Scrapy?

Question

I have implemented three layers recursion to generate seed list of urls and then scraping the info from each url. I want to use multiprocessing to take advantage of all cores of my system to speed up crawling. Here is the crawler code I have implemented so far.

# -*- coding: utf-8 -*-
import scrapy
from scrapy.selector import HtmlXPathSelector
from scrapy.http.request import Request

from CompanyInfoGrabber.Utility.utils import getAddress, getCompanyStatus, getDirectorDetail, getRegNumber


class CompanyInfoGrabberSpider(scrapy.Spider):
    name = 'CompanyDetail'
    allowed_domains = ['example.com']
    start_urls = ['http://example.com']

    def parse(self, response):
        counter = 0
        print("User Agent in parse() is : ", response.request.headers['User-Agent'])
        hxp = HtmlXPathSelector(response)
        URL_LIST = hxp.select('//sitemapindex/sitemap/loc/text()').extract()
        print("URL LIST: ", URL_LIST)
        for URL in URL_LIST[:2]:
           next_page = response.urljoin(URL)
           yield Request(next_page, self.parse_page)

    def parse_page(self, response):
        print("User Agent in parse_page is : ", response.request.headers['User-Agent'])
        hxp = HtmlXPathSelector(response)

        # create seed list of company-url
        COMPANY_URL_LIST = hxp.select('//urlset/url/loc/text()').extract()
        print("Company url: ", COMPANY_URL_LIST[:20])
        """
        Here I want to use multiprocessing like this
        pool = Pool(processes=8)
        pool.map(parse_company_detail, COMPANY_URL_LIST)
        """
        for company_url in COMPANY_URL_LIST[:5]:
            next_page = response.urljoin(company_url)
            yield Request(next_page, self.parse_company_detail)

    def parse_company_detail(self, response):
        COMPANY_DATA = dict()
        print("User Agent in parse_company_page() is : ", response.request.headers['User-Agent'])
        hxp = HtmlXPathSelector(response)
        _ABOUT_ = ''.join(hxp.xpath('normalize-space(//div[@class="panel-body"]/text())').extract())
        for node in hxp.xpath('//div[@class="panel-body"]//p'):
            _ABOUT_ += ''.join(node.xpath('string()').extract())

        COMPANY_DATA['About'] = _ABOUT_
        # Get company data.
    COMPANY_DATA = getDirectorDetail(COMPANY_DATA, hxp)

    print("Dictionary: ", COMPANY_DATA)
    return COMPANY_DATA

How can I use multiprocessing to crawl seed list of url? Thanks in Advance.

Update: My question is not duplicates of this. Here I have only one spider.

Regards,

Om Prakash

Possible duplicate of [Multiprocessing of Scrapy Spiders in Parallel Processes](https://stackoverflow.com/questions/31087268/multiprocessing-of-scrapy-spiders-in-parallel-processes) — Clément Denoix, Nov 24 '17 at 02:10
@ClémentDenoix, No it's not a duplicate. I have only one spider here. — Om Prakash, Nov 24 '17 at 02:50

score 0 · Accepted Answer · answered Nov 23 '17 at 12:53

I would recommend using the threading module to run multiple threads simultaneously. You would need to modify your class to take a URL argument in the init.

import threading

sites = ['URL1','URL2','URL3']

def create_instance():
    global sites
    CompanyInfoGrabberSpider(scrapy.Spider,sites[0])
    sites.remove[sites[0]]

for site in sites:
    threading.Thread(target=create_instance).start() # Create and start thread

How to use multiprocessing in Python to crawl millions of urls using Scrapy?

1 Answers1