How to use scrapy to scrape SO question and store in MongoDB?

Question

This web scraping with Scrapy is a little bit outdated link It seems that Selector XPath has been changed.

When I copy it,I have

def parse(self, response):
    questions = Selector(response).xpath('//*[@id="question-header"]/h1/a')

But code from above link

class StackSpider(Spider):
    name = "stack"
    allowed_domains = ["stackoverflow.com"]
    start_urls = [
        "http://stackoverflow.com/questions?pagesize=50&sort=newest",
    ]

    def parse(self, response):
        questions = Selector(response).xpath('//div[@class="summary"]/h3')

        for question in questions:
            item = StackItem()
            item['title'] = question.xpath(
                'a[@class="question-hyperlink"]/text()').extract()[0]
            item['url'] = question.xpath(
                'a[@class="question-hyperlink"]/@href').extract()[0]
            yield item

How to constructor generator with new Selector?

This is

Spring data @transactional not rolling back with SQL Server and after runtimeexception

the SO question we are scraping as an example.

Matthew Daniels suggestions

In [4]: response                                                                                                                                                                                            
Out[4]: <200 https://stackoverflow.com/questions/27624141/spring-data-transactional-not-rolling-back-with-sql-server-and-after-runtimeexc>

In [5]: response.css(".question-hyperlink").xpath("@href").extract_first()                                                                                                                                  
Out[5]: '/questions/27624141/spring-data-transactional-not-rolling-back-with-sql-server-and-after-runtimeexc'

In [6]: response.css(".summary h3")                                                                                                                                                                         
Out[6]: []

In [7]: response.css("#question-header > h1 > a")                                                                                                                                                           
Out[7]: [<Selector xpath="descendant-or-self::*[@id = 'question-header']/h1/a" data='<a href="/questions/27624141/spring-data'>]

What, exactly, is the error you are encountering? Also, one need not manually construct `Selector` since `response.xpath()` behaves rationally. Also, given how much `@class` selector you are using, you'll find `response.css("#question-header > h1 > a")` and `response.css(".summary h3")` much more legible. You can still chain them to use those `xpath` functions: `response.css(".question-hyperlink").xpath("@href").extract_first()` — mdaniel, Nov 13 '18 at 03:47
"summary h3 is empty.Why?" because you have mixed up the single question page used in `Out[4]` with the `/questions?pagesize=50` view which contains only the _summaries_ of the questions — mdaniel, Nov 13 '18 at 16:42

How to use scrapy to scrape SO question and store in MongoDB?

0 Answers0