I am reading Web Scraping with Python 2nd Ed, and wanted to use Scrapy module to crawl information from webpage.
I got following information from documentation: https://docs.scrapy.org/en/latest/topics/request-response.html
callback (callable) – the function that will be called with the response of this request (once it’s downloaded) as its first parameter. For more information see Passing additional data to callback functions below. If a Request doesn’t specify a callback, the spider’s parse() method will be used. Note that if exceptions are raised during processing, errback is called instead.
My understanding is that:
pass in url and get resp like we did in requests module
resp = requests.get(url)
pass in resp for data parsing
parse(resp)
The problem is:
- I didn't see where resp is passed in
- Why need to put self keyword before parse in the argument
- self keyword was never used in parse function, why bothering put it as first parameter?
- can we extract url from response parameter like this: url = response.url or should be url = self.url
class ArticleSpider(scrapy.Spider):
name='article'
def start_requests(self):
urls = [
'http://en.wikipedia.org/wiki/Python_'
'%28programming_language%29',
'https://en.wikipedia.org/wiki/Functional_programming',
'https://en.wikipedia.org/wiki/Monty_Python']
return [scrapy.Request(url=url, callback=self.parse) for url in urls]
def parse(self, response):
url = response.url
title = response.css('h1::text').extract_first()
print('URL is: {}'.format(url))
print('Title is: {}'.format(title))