0

I have this code available from my previous experiment.

import scrapy

class BlogSpider(scrapy.Spider):
    name = 'blogspider'
    start_urls = ['http://example.com/']

    def parse(self, response):
        for title in response.css('h2'):
            yield {'Agent-name': title.css('a ::text').extract_first()}

        next_page = response.css('li.col-md-3 ln-t > div.cs-team team-grid > figure > a ::attr(href)').extract_first()
        if next_page:
            yield scrapy.Request(response.urljoin(next_page), callback=self.parse)

I am not understanding how to modify this code to take input as a list of URL from a text file (May be 200+ domains) and check the HTTP status of the domains and store it in a file. I am trying this to check whether the domains are live or not.
What I am expecting to have output is:

example.com,200
example1.com,300
example2.com,503

I want to give file as an input to scrapy script and it should give me the above output. I have tried to look at the questions: How to detect HTTP response status code and set a proxy accordingly in scrapy? and Scrapy and response status code: how to check against it?
But find no luck. Hence, I am thinking to modify my code and get it done. How I can do that? Please help me.

Community
  • 1
  • 1
Jaffer Wilson
  • 7,029
  • 10
  • 62
  • 139
  • What do you mean by _"check the HTTP status of the domains"_? Your file has URLs, do you mean _check the HTTP status of each URL_? By default, scrapy will only feed your callbacks with HTTP 200 responses. You can look at [`handle_httpstatus_all`](https://docs.scrapy.org/en/latest/topics/spider-middleware.html#std:reqmeta-handle_httpstatus_all) meta key to get non-200 responses too. – paul trmbrth Feb 01 '17 at 11:57
  • @paultrmbrth I just want to store the status code of the URL in another file with the URL. If possible then I will keep the live and wipe out the rest. This is what I am trying to do. Can you help me? I will surely read the document again. Previously I have read but it was not helpful to me. – Jaffer Wilson Feb 01 '17 at 12:00
  • Are you able to collect (url, status) items? something like `yield {"url": response.url, "status": response.status}` in your callback should already give you all HTTP-200 responses. – paul trmbrth Feb 01 '17 at 13:35
  • @paultrmbrth I am trying to input file using this solution: http://stackoverflow.com/questions/8376630/scrapy-read-list-of-urls-from-file-to-scrape I don't know why it is not working. How should I input file with multiple domain to scrapy? – Jaffer Wilson Feb 01 '17 at 14:12
  • I suggest that you open a new question, paste your code as-is, share logs (preferably with `LOG_LEVEL='DEBUG'`) and the content of the input file. "it is not working" is too vague for the community to help you. – paul trmbrth Feb 01 '17 at 14:23

1 Answers1

0

For each response object you could be able to get the url and status code thx to response object properties. So for each link you send request to, you can get the status code using response.status. Does it work as you want like that ?

def parse(self, response): #file choosen to get output (appending mode): file.write(u"%s : %s\n" % (response.url, response.status)) #if response.status in [400, ...]: do smthg for title in response.css('h2'): yield {'Agent-name': title.css('a ::text').extract_first()} next_page = response.css('li.col-md-3 ln-t > div.cs-team team-grid > figure > a ::attr(href)').extract_first() if next_page: yield scrapy.Request(response.urljoin(next_page), callback=self.parse)

Pablo
  • 217
  • 1
  • 10