How can I properly run scrapy spiders from an external python script and get its item output

Question

So I'm making a couple of scrapers and now I'm trying to make a script that runs the corresponding spiders with URLs collected from a DB but I can't find a way to do this.

I have this in my spider:

class ElCorteIngles(scrapy.Spider):
name = 'ElCorteIngles'
url = ''
DEBUG = False

def start_requests(self):
    if self.url != '':
        yield scrapy.Request(url=self.url, callback=self.parse)

def parse(self, response):
    # Get product name
    try:
        self.p_name = response.xpath('//*[@id="product-info"]/h2[1]/a/text()').get()
    except:
        print(f'{CERROR} Problem while getting product name from website - {self.name}')

    # Get product price
    try:
        self.price_no_cent = response.xpath('//*[@id="price-container"]/div/span[2]/text()').get()
        self.cent = response.xpath('//*[@id="price-container"]/div/span[2]/span[1]/text()').get()
        self.currency = response.xpath('//*[@id="price-container"]/div/span[2]/span[2]/text()').get()
        if self.currency == None:
            self.currency = response.xpath('//*[@id="price-container"]/div/span[2]/span[1]/text()').get()
            self.cent = None
    except:
        print(f'{CERROR} Problem while getting product price from website - {self.name}')

    # Join self.price_no_cent with self.cent
    try:
        if self.cent != None:
            self.price = str(self.price_no_cent) + str(self.cent)
            self.price = self.price.replace(',', '.')
        else:
            self.price = self.price_no_cent
    except:
        print(f'{ERROR} Problem while joining price with cents - {self.name}')

    # Return data
    if self.DEBUG == True:
        print([self.p_name, self.price, self.currency])

    data_collected = ShopScrapersItems()
    data_collected['url'] = response.url
    data_collected['p_name'] = self.p_name
    data_collected['price'] = self.price
    data_collected['currency'] = self.currency

    yield data_collected

Normally when I run the spider from the console I do:

scrapy crawl ElCorteIngles -a url='https://www.elcorteingles.pt/electrodomesticos/A26601428-depiladora-braun-senso-smart-5-5500/'

and now I need a way to do the same on a external script and get the output yield data_collected

What I currently have in my external script is this:

import scrapy
from scrapy.crawler import CrawlerProcess
import sqlalchemy as db
# Import internal libraries
from Ruby.Ruby.spiders import *

# Variables
engine = db.create_engine('mysql+pymysql://DATABASE_INFO')

class Worker(object):

    def __init__(self):
        self.crawler = CrawlerProcess({})

    def scrape_new_links(self):
        conn = engine.connect()

        # Get all new links from DB and scrape them
        query = 'SELECT * FROM Ruby.New_links'
        result = conn.execute(query)
        for x in result:
            telegram_id = x[1]
            email = x[2]
            phone_number = x[3]
            url = x[4]
            spider = x[5]
            
            # In this cade the spider will be ElCorteIngles and
            # the url https://www.elcorteingles.pt/electrodomesticos/A26601428-depiladora- 
            # braun-senso-smart-5-5500/'

            self.crawler.crawl(spider, url=url)
            self.crawler.start()

Worker().scrape_new_links()

I also don't know if doing url=url in self.crawler.crawl() is the proper way to give the URL to the spider but let me know what you think. All data from yield is being returned by a pipeline. I think there is no need for extra info but if you need any just let me know!

`except:` see https://stackoverflow.com/questions/54948548/what-is-wrong-with-using-a-bare-except. — AMC, Mar 06 '20 at 01:24

ThePyGuy · Accepted Answer · 2020-03-06T10:02:44.713

Scrapy works asynchronously...ignore my imports but this is a JSON api I made for scrapy. You need to make a custom runner with an item_scraped signal. There was originally a klein endpoint and when the spider finished it would return a JSON list. I think this is what you want but without the klein endpoint so I've taken it out. My spider was GshopSpider I replaced it with your spiders name.

By taking advantage of deferred we are able to use callbacks and send signals each time an item is scraped. So using this code we collect each item into a list with a signal and when the spider finishes we have a callback setup to return_spider_output

# server.py
import json

from scrapy import signals
from scrapy.crawler import CrawlerRunner

from googleshop.spiders.gshop import GshopSpider
from scrapy.utils.project import get_project_settings


class MyCrawlerRunner(CrawlerRunner):
    def crawl(self, crawler_or_spidercls, *args, **kwargs):
        # keep all items scraped
        self.items = []

        crawler = self.create_crawler(crawler_or_spidercls)

        crawler.signals.connect(self.item_scraped, signals.item_scraped)

        dfd = self._crawl(crawler, *args, **kwargs)

        dfd.addCallback(self.return_items)
        return dfd

    def item_scraped(self, item, response, spider):
        self.items.append(item)

    def return_items(self, result):
        return self.items


def return_spider_output(output):
    return json.dumps([dict(item) for item in output])


if __name__=="__main__"
    settings = get_project_settings()
    runner = MyCrawlerRunner(settings)
    spider = ElCorteIngles()
    deferred = runner.crawl(spider)
    deferred.addCallback(return_spider_output)
    return deferred

score 0 · Answer 2 · answered Mar 05 '20 at 21:37

The easiest way to do this would be something like this:

class ElCorteIngles(scrapy.Spider):
    name = 'ElCorteIngles'
    url = ''
    DEBUG = False

    def __init__(self):
        super().__init__(self, **kwargs)

        # Establish your db connection here. This can be any database connection.
        # Reuse this connection object anywhere else
        self.conn = conn = engine.connect()

    def start_requests(self):
        with self.conn.cursor() as cursor:
            cursor.execute('''SELECT * FROM Ruby.New_links WHERE url NOT NULL OR url != %s''', ('',))
            result = cursor.fetchall()
         for url in result:
             yield scrapy.Request(url=url, dont_filter=True, callback=self.parse)
    def parse(self):

        # Your Parse code here

After Doing this you can initiate this crawler using something like this

from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
from project_name.spiders.filename import ElCorteIngles


process = CrawlerProcess(get_project_settings())
process.crawl(ElCorteIngles)
process.start()

Hope this helps.

I would also recommend you to have a queue if you are working with a large number of URLs. This will enable multiple spider processes to work on these URLs in parallel. You can initiate the queue in the init method.

How can I properly run scrapy spiders from an external python script and get its item output

2 Answers2