3

I am trying to build a crawler using Scrapy. Every tutorial in the Scrapy' sofficial documentation or in the blog, I See people making a class in the .py code and executing it through scrapy shell.

On their main page, the following example is given

import scrapy

class BlogSpider(scrapy.Spider):
    name = 'blogspider'
    start_urls = ['https://blog.scrapinghub.com']

    def parse(self, response):
        for title in response.css('h2.entry-title'):
            yield {'title': title.css('a ::text').extract_first()}

        for next_page in response.css('div.prev-post > a'):
            yield response.follow(next_page, self.parse)

and then the code is run using

scrapy runspider myspider.py

I am unable to find a way to write the same code in a manner that can be executed with something like

python myspider.py

I also looked in the Requests and response section of their website to understand how the requests and responses are dealt within the shell but trying running those codes within python shell

( >> python myspider.py )

did not show anything. Any guidance on how to transform the code so that it runs out of scrapy shell, or pointers to any documents that elaborate this will be appreciated.

EDIT Downvoters please do not take undue advantage of your anonymity. If you have a valid point to downvote, please make your point in the comment after you downvote.

harshvardhan
  • 765
  • 13
  • 32
  • I don't know about running the `spider.py` script directly as-is, but this might help? https://stackoverflow.com/a/31374345/3381305 – Matt Hall Mar 09 '18 at 12:56

1 Answers1

6

You can use a CrawlerProcess to run your spider in Python main script, and run with python myspider.py

For example:

import scrapy
from scrapy.crawler import CrawlerProcess


class BlogSpider(scrapy.Spider):
    name = 'blogspider'
    start_urls = ['https://blog.scrapinghub.com']

    def parse(self, response):
        for title in response.css('h2.entry-title'):
            yield {'title': title.css('a ::text').extract_first()}

        for next_page in response.css('div.prev-post > a'):
            yield response.follow(next_page, self.parse)


if __name__ == '__main__':
    class MySpider(scrapy.Spider):
        # Your spider definition
        ...


    process = CrawlerProcess({
        'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
    })

    process.crawl(BlogSpider)
    process.start()

Useful link https://doc.scrapy.org/en/latest/topics/practices.html

Ami Hollander
  • 2,435
  • 3
  • 29
  • 47
  • Thanks, this works. I would also like to know how to return some value. The return functions inside 'parse' does not work. Where shall I place return function when I have to return some value? – harshvardhan Mar 14 '18 at 13:13
  • For example, inside the parse function, I want to assign some attribute from response to a variable that can be globally used. – harshvardhan Mar 14 '18 at 13:25
  • First, you cannot return value from `parse` cause you are not calling the function so you cannot get his return value. You can assign a static variable and set it inside `parse`, declare a static variable allow you to use it globally. – Ami Hollander Mar 15 '18 at 11:42
  • @harshvardhan I'm trying to do exactly the same thing and I haven't found how to do it yet. Scrapy looks nice, it has a lot of built-in stuff if you want to write a scraper, but I simply need to parse a page for some values, and RETURN these values, so my script can decide what to do with returned values. I don't want to run scrapy as a command... I need to run my own script – Andrea Grandi Aug 14 '18 at 22:22