0

I am creating a simple scrapy project to better understand how to use it, and what I intend to do is crawl the questions page for StackOverflow.

My spider is called first and here's the content of the file.

# -*- coding: utf-8 -*-
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor


class FirstSpider(CrawlSpider):
    name = 'first'
    allowed_domains = ['stackoverflow.com']
    start_urls = ['https://stackoverflow.com/questions']

    rules = (
            Rule(LinkExtractor(allow=['/questions/\?page=\d&sort=newest']), callback='parse_item', follow=True),
    )

    def parse_item(self, response):
        item = scrapy.Item()

        selector_list = response.css('.question-summary')

        for selector in selector_list:
            item['question'] = selector.css('h3 a::text').extract()
            item['votes'] = selector.css('.vote-count-post strong::text').extract()
            item['answers'] = selector.css('.status strong::text').extract()
            item['views'] = selector.css('.views::text').extract()
            item['username'] = selector.css('.user-details a::text').extract()
            item['user-link'] = selector.css('.user-details a::attr(href)').extract()

        return item

It should then traverse the pages of the questions gathering the info.

I can get the data from the shell but fail when I try to crawl or save the output.

Here's the output of scrapy crawl first:

2018-04-07 13:57:06 [scrapy.utils.log] INFO: Scrapy 1.5.0 started (bot: scraper)
2018-04-07 13:57:06 [scrapy.utils.log] INFO: Versions: lxml 4.2.1.0, libxml2 2.9                                                               .5, cssselect 1.0.3, parsel 1.4.0, w3lib 1.19.0, Twisted 17.9.0, Python 2.7.14 (                                                               v2.7.14:84471935ed, Sep 16 2017, 20:19:30) [MSC v.1500 32 bit (Intel)], pyOpenSS                                                               L 17.5.0 (OpenSSL 1.1.0h  27 Mar 2018), cryptography 2.2.2, Platform Windows-XP-                                                               5.1.2600-SP3
2018-04-07 13:57:06 [scrapy.crawler] INFO: Overridden settings: {'NEWSPIDER_MODU                                                               LE': 'scraper.spiders', 'SPIDER_MODULES': ['scraper.spiders'], 'ROBOTSTXT_OBEY':                                                                True, 'BOT_NAME': 'scraper'}
2018-04-07 13:57:06 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.logstats.LogStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.corestats.CoreStats']
2018-04-07 13:57:07 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2018-04-07 13:57:07 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2018-04-07 13:57:07 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2018-04-07 13:57:07 [scrapy.core.engine] INFO: Spider opened
2018-04-07 13:57:07 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pag                                                               es/min), scraped 0 items (at 0 items/min)
2018-04-07 13:57:07 [scrapy.extensions.telnet] DEBUG: Telnet console listening o                                                               n 127.0.0.1:6023
2018-04-07 13:57:08 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://stackoverflow.com/robots.txt> (referer: None)
2018-04-07 13:57:09 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://stackoverflow.com/questions> (referer: None)
2018-04-07 13:57:09 [scrapy.core.engine] INFO: Closing spider (finished)
2018-04-07 13:57:09 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 502,
 'downloader/request_count': 2,
 'downloader/request_method_count/GET': 2,
 'downloader/response_bytes': 35092,
 'downloader/response_count': 2,
 'downloader/response_status_count/200': 2,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2018, 4, 7, 10, 57, 9, 609000),
 'log_count/DEBUG': 3,
 'log_count/INFO': 7,
 'response_received_count': 2,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'start_time': datetime.datetime(2018, 4, 7, 10, 57, 7, 625000)}
2018-04-07 13:57:09 [scrapy.core.engine] INFO: Spider closed (finished)
Robo Mop
  • 3,485
  • 1
  • 10
  • 23
Sam B.
  • 2,703
  • 8
  • 40
  • 78
  • Have you tried `extract_first()`? If you are using pycharm, why not put a debug point before `return item`, here is how. https://stackoverflow.com/questions/21788939/how-to-use-pycharm-to-debug-scrapy-projects –  Apr 07 '18 at 12:15
  • replacing with `extract_first` and adding `extract` to `selector_list` returns the same list. – Sam B. Apr 07 '18 at 12:25
  • replace the `return item` with `yield item`. ^_^ –  Apr 07 '18 at 12:47
  • the crawl still doesn't happen `2018-04-07 15:53:43 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) ` it doesn't even crawl the first page. – Sam B. Apr 07 '18 at 12:56
  • Can you post your `CrawlSpider` code? –  Apr 07 '18 at 12:57
  • I think maybe you need to define your items in `items.py` field, you can refer to this, https://stackoverflow.com/a/48918265/8389458 –  Apr 07 '18 at 12:59

1 Answers1

1

Item fields should be defined in the items.py like described here (otherwise there will be KeyError): https://doc.scrapy.org/en/latest/topics/items.html#declaring-items.

In your case above item needs to be created and yielded within a loop (not outside). Like:

for selector in selector_list:
    item = QuestionItem()
    item['question'] = selector.css('h3 a::text').get()
    ...
    yield item

Also, consider using item loaders to populate items: https://doc.scrapy.org/en/latest/topics/loaders.html

ToryMur
  • 11
  • 1
  • used this but it still doesn't perform a crawl. – Sam B. Apr 09 '18 at 06:26
  • Sorry, it's hard to detect a problem without looking at the code (provided code example won't work for several reasons). Based on scrapy logs everything is okay with the spider, problem only with parsing the response and returning items. – ToryMur Apr 09 '18 at 08:52
  • Also try `parse` callback, not `parse_item`. – ToryMur Apr 09 '18 at 08:59
  • After a bit of back and forth got it working. Now dealing with how to query specific links https://stackoverflow.com/questions/49728311/scrapy-traverse-other-links – Sam B. Apr 09 '18 at 09:02
  • 1
    @SamB. What exactly got it working? I'm facing this exact same problem right now... INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) – rom Mar 07 '21 at 00:35