0

I am trying to build a crawler (using scrapy) that launches spiders from a main.py with multiprocessing.

The first spider (cat_1) is launched without multiprocessing using scrapy.crawler.CrawlerProcess :

crawler_settings = Settings()
crawler_settings.setmodule(default_settings)
runner = CrawlerProcess(settings=crawler_settings)
runner.crawl(cat_1)
runner.start(stop_after_crawl=True)

It works fine, I have all the data handled by the FEED.

The next spider needs the first spider's results and goes for multiprocessing :

After loading the results from first spider, I create a list of URL and send it to my function process_cat_2(). This function creates processes and each one of them would launch the spider cat_2 :

from multiprocessing import Process

def launch_crawler_cat_2(crawler, url):
    cat_name = url[0]
    cat_url = url[1]
    
    runner.crawl(crawler, cat_name, cat_url)


def process_cat_2(url_list):
    nb_spiders = len(url_list)
    list_process = [None] * nb_spiders
    
    while(url_list):
        for i in range(nb_spiders):
            if not (list_process[i] and list_process[i].is_alive()):
                list_process[i] = Process(target=launch_crawler_cat_2, args=(cat_2, url_list.pop(0)))
                list_process[i].start()
                # break

    # Wait all thread end
    for process in list_process:
        if process:
            # process.start()
            process.join()

The problem is that runner.crawl(crawler, cat_name, cat_url) (in cat_2) does not crawl anything :

2021-10-07 17:20:38 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)

And I do not know how to use the existing twisted.internet.reactor so to avoid this error :

twisted.internet.error.ReactorNotRestartable

When using :

def launch_crawler_cat_2(crawler, url):
    cat_name = url[0]
    cat_url = url[1]
    
    runner.crawl(crawler, cat_name, cat_url)
    runner.start()

How can I launch a new spider with the existing reactor object ?

IndiaSke
  • 348
  • 1
  • 2
  • 10

2 Answers2

0

Here's a solution for those stuck at the same point I was. I was able to run multiple spiders where some spiders need the results of previous ones, and some spiders with multiprocessing.

Initialize each Crawler in a different process :

import sys
import json
import pandas as pd
from multiprocessing import Process

## Scrapy
from scrapy.crawler import CrawlerProcess
from scrapy.settings import Settings

## Spiders & settings
from myproject.spiders.cat_1 import cat_1
from myproject.spiders.cat_2 import cat_2
from myproject import settings as default_settings

## Init crawler
crawler_settings = Settings()
crawler_settings.setmodule(default_settings)

# runner = CrawlerRunner(settings=default_settings)
runner = CrawlerProcess(settings=crawler_settings)

def launch_crawler_cat_2(crawler, url):
    process = CrawlerProcess(crawler_settings)
    process.crawl(crawler,url[0],url[1])
    process.start(stop_after_crawl=True)

def process_cat_2(url_list):
    nb_spiders = 5
    list_process = [None] * nb_spiders
    
    while(url_list):
        for i in range(nb_spiders):
            if not (list_process[i] and list_process[i].is_alive()):
                list_process[i] = Process(target=launch_crawler_cat_2, args=(cat_2, url_list.pop(0)))
                list_process[i].start()
                break

    # Wait all thread end
    for process in list_process:
        if process:
            process.join()

def crawl_cat_1():
    process = CrawlerProcess(crawler_settings)
    process.crawl(cat_1)
    process.start(stop_after_crawl=True)

if __name__=="__main__":

    ## Scrape cat_1
    process_cat_1 = Process(target=crawl_cat_1)
    process_cat_1.start()
    process_cat_1.join()

    ##########################################################################
    ########## LOAD cat_1 RESULTS
    try:
        with open('./cat_1.json', 'r+', encoding="utf-8") as f:
            lines = f.readlines()
            lines = [json.loads(line) for line in lines]
            df_cat_1 = pd.DataFrame(lines)
    except:
        df_cat_1 = pd.DataFrame([])

    print(df_cat_1)
    if df_cat_1.empty:
        sys.exit('df_cat_1 empty DataFrame')

    df_cat_1['cat_1_tuple'] = list(zip(df_cat_1.cat_name, df_cat_1.cat_url))
    df_cat_1_tuple_list = df_cat_1.cat_1_tuple.tolist()

    process_cat_2(df_cat_1_tuple_list)
IndiaSke
  • 348
  • 1
  • 2
  • 10
0

Well.. I found a solution to run multiple spiders, multiple times by using CrawlerRunner as recommended by docs https://docs.scrapy.org/en/latest/topics/practices.html#run-scrapy-from-a-script

There’s another Scrapy utility that provides more control over the crawling process: scrapy.crawler.CrawlerRunner. This class is a thin wrapper that encapsulates some simple helpers to run multiple crawlers, but it won’t start or interfere with existing reactors in any way.

Using this class the reactor should be explicitly run after scheduling your spiders. It’s recommended you use CrawlerRunner instead of CrawlerProcess if your application is already using Twisted and you want to run Scrapy in the same reactor.

Here is my solution: https://stackoverflow.com/a/71643677/18604520