0

I am reading Web Scraping with Python 2nd Ed, and wanted to use Scrapy module to crawl information from webpage.

I got following information from documentation: https://docs.scrapy.org/en/latest/topics/request-response.html

callback (callable) – the function that will be called with the response of this request (once it’s downloaded) as its first parameter. For more information see Passing additional data to callback functions below. If a Request doesn’t specify a callback, the spider’s parse() method will be used. Note that if exceptions are raised during processing, errback is called instead.

My understanding is that:

  1. pass in url and get resp like we did in requests module

    resp = requests.get(url)

  2. pass in resp for data parsing

    parse(resp)

The problem is:

  1. I didn't see where resp is passed in
  2. Why need to put self keyword before parse in the argument
  3. self keyword was never used in parse function, why bothering put it as first parameter?
  4. can we extract url from response parameter like this: url = response.url or should be url = self.url
class ArticleSpider(scrapy.Spider):
    name='article'
    
    def start_requests(self):
        urls = [
        'http://en.wikipedia.org/wiki/Python_'
        '%28programming_language%29',
        'https://en.wikipedia.org/wiki/Functional_programming',
        'https://en.wikipedia.org/wiki/Monty_Python']

        return [scrapy.Request(url=url, callback=self.parse) for url in urls]
    

    def parse(self, response):
        url = response.url
        title = response.css('h1::text').extract_first()
        print('URL is: {}'.format(url))
        print('Title is: {}'.format(title))
x86_64
  • 95
  • 1
  • 10
  • scrapy uses async and is built to be used as a generator (use `yield` always), the convention is to pass `self, response` in any of its functions that handle `response` – wishmaster Jul 04 '20 at 21:17

2 Answers2

0

information about self you can find here - https://docs.python.org/3/tutorial/classes.html


about this question:

can we extract URL from response parameter like this: url = response.url or should be url = self.url

you should use response.url to get URL of the page which you currently crawl/parse

Roman
  • 1,883
  • 2
  • 14
  • 26
  • Hi Roman, thanks for the feedback, you got my point. What I am asking is why we can get the url from response.url, as I didn't see anywhere this parameter is explicitly passed in. – x86_64 Feb 26 '21 at 14:48
0

Seems like you are missing a few concepts related to python classes and OOP. It would be a good idea to take a read in python docs or at the very least this question.

Here is how Scrapy works, you instantiate a request object and yield it to the Scrapy Scheduler.

yield scrapy.Request(url=url) #or use return like you did

Scrapy will handle the requests, download the html and it will return all it got back that request to a callback function. If you didn't set a callback function in your request (like in my example above) it will call a default function called parse.

Parse is a method (a.k.a function) of your object. You wrote it in your code above, and EVEN if you haven't it would still be there, since your class inherited all functions from it's parent class

class ArticleSpider(scrapy.Spider): # <<<<<<<< here
    name='article'

So a TL; DR of your questions:

1-You didn't saw it because it happened in the parent class.

2-You need to use self. so python knows you are referencing a method of the spider instance.

3-The self parameter was the instance itself, and it was used by python.

4-Response is an independent object that your parse method received as argument, so you can access it's attributes like response.url or response.headers

renatodvc
  • 2,526
  • 2
  • 6
  • 17
  • actually you can skip the convention `parse` and use only `staticmethod` (or even normal functions) to handle the response – wishmaster Jul 04 '20 at 21:13
  • Hi renatodvc, what I am confused is why we can get the url from response.url, as I didn't see anywhere this parameter is explicitly passed in. – x86_64 Feb 26 '21 at 14:48