I use Scrapy to crawl and scrap StackOverflow.com . This is so.py
import scrapy
class StackOverflowSpider(scrapy.Spider):
name = 'stackoverflow'
start_urls = ['http://stackoverflow.com']
def parse(self, response):
for href in response.css('.question-summary h3 a::attr(href)'):
full_url = response.urljoin(href.extract())
yield scrapy.Request(full_url, callback=self.parse_question)
def parse_question(self, response):
yield {
'link': response.url,
}
Expected result: so.json (valid JSON format)
[
"http://stackoverflow.com/questions/36421917/exponential-number-in-custom-number-format-of-excel",
"http://stackoverflow.com/questions/36421343/can-not-install-requirements-txt",
"http://stackoverflow.com/questions/36418815/difference-between-two-approaches-to-pass-parameters-to-web-server",
"http://stackoverflow.com/questions/36421743/sharing-an-oracle-database-connection-between-simultaneous-celery-tasks",
"http://stackoverflow.com/questions/36421941/jquery-add-css-style",
]
Then run:
scrapy runspider so.py -o so.json
The result isn't like above expected. I stuck at here.