1

Sometimes when I scrape a site, it does not return urls with the hostname (e.g /search/en or search/en). How do I get the hostname in scrapy so I can add it before making a request? Currently, I am hardcoding it.

def parse_table(self, response):
    for links in self._parse_xpath(response, 'table'):
        for link in links:
            # Annoying part, it's not dynamic and hardcoded, other 
            #functions also need to do this because of incomplete urls.
            yield Request(url='https://domain.io' + link,
                        callback=self.parse_document_tab)
  • Can you provide a MWE? – robertspierre Sep 10 '18 at 12:44
  • What are you using to scrape a site e.g. the scrapy code? what have you tried so far and are you getting any errors or is it just not working correctly in that instance? – Kyhle Ohlinger Sep 10 '18 at 12:44
  • @raffamaiden Whats an MWE? –  Sep 10 '18 at 12:45
  • @KyhleOhlinger Scrapy is working fine but it is annoying to have to deal with different urls before parsing them into requests. Heres a snipped on post. –  Sep 10 '18 at 12:48
  • So if i'm correct you want to get the full URL? Have you looked at https://stackoverflow.com/questions/6499603/python-scrapy-convert-relative-paths-to-absolute-paths ? – Kyhle Ohlinger Sep 10 '18 at 12:50
  • The solution I want to have is to have a url resolver that adds the collected url to the referer if it does not have a / in front and adding the collected url to the domain if it have a / . Problem is I can't find any methods to get the hostname or referers. –  Sep 10 '18 at 12:51
  • @KyhleOhlinger That is very useful. I actually have not check on that. –  Sep 10 '18 at 12:55
  • @KyhleOhlinger love it, that is just what I need! Thank you! –  Sep 10 '18 at 13:00
  • @KuoChongYii glad I could help :) – Kyhle Ohlinger Sep 10 '18 at 13:08

1 Answers1

0

You can either use the response.urljoin method to join your relative URL to the base URL:

def parse_table(self, response):
    for links in self._parse_xpath(response, 'table'):
        for link in links:
            yield Request(url=response.urljoin(link),
                          callback=self.parse_document_tab)

Or the brand new response.follow (Scrapy 1.4.0+) method, that builds the proper absolute URL and returns a Request object:

def parse_table(self, response):
    for links in self._parse_xpath(response, 'table'):
        for link in links:
            yield response.follow(link, callback=self.parse_document_tab)
Valdir Stumm Junior
  • 4,568
  • 1
  • 23
  • 31
  • I will use the second method just because it's newer. Thank you for the answer! –  Sep 10 '18 at 14:31
  • another advantage of the second method is that it works even if `link` is not a string. if it's a `Selector` object wrapping an `a` element, `.follow` will extract the `href` from that element for you. – Valdir Stumm Junior Sep 10 '18 at 17:48