How to get the hostname of the request?

Question

Sometimes when I scrape a site, it does not return urls with the hostname (e.g /search/en or search/en). How do I get the hostname in scrapy so I can add it before making a request? Currently, I am hardcoding it.

def parse_table(self, response):
    for links in self._parse_xpath(response, 'table'):
        for link in links:
            # Annoying part, it's not dynamic and hardcoded, other 
            #functions also need to do this because of incomplete urls.
            yield Request(url='https://domain.io' + link,
                        callback=self.parse_document_tab)

What are you using to scrape a site e.g. the scrapy code? what have you tried so far and are you getting any errors or is it just not working correctly in that instance? — Kyhle Ohlinger, Sep 10 '18 at 12:44
@KyhleOhlinger Scrapy is working fine but it is annoying to have to deal with different urls before parsing them into requests. Heres a snipped on post. — , Sep 10 '18 at 12:48
So if i'm correct you want to get the full URL? Have you looked at https://stackoverflow.com/questions/6499603/python-scrapy-convert-relative-paths-to-absolute-paths ? — Kyhle Ohlinger, Sep 10 '18 at 12:50
The solution I want to have is to have a url resolver that adds the collected url to the referer if it does not have a / in front and adding the collected url to the domain if it have a / . Problem is I can't find any methods to get the hostname or referers. — , Sep 10 '18 at 12:51
@KyhleOhlinger That is very useful. I actually have not check on that. — , Sep 10 '18 at 12:55
@KyhleOhlinger love it, that is just what I need! Thank you! — , Sep 10 '18 at 13:00

score 0 · Accepted Answer · answered Sep 10 '18 at 14:08

0

You can either use the response.urljoin method to join your relative URL to the base URL:

def parse_table(self, response):
    for links in self._parse_xpath(response, 'table'):
        for link in links:
            yield Request(url=response.urljoin(link),
                          callback=self.parse_document_tab)

Or the brand new response.follow (Scrapy 1.4.0+) method, that builds the proper absolute URL and returns a Request object:

def parse_table(self, response):
    for links in self._parse_xpath(response, 'table'):
        for link in links:
            yield response.follow(link, callback=self.parse_document_tab)

answered Sep 10 '18 at 14:08

Valdir Stumm Junior

4,568
1
23
31

I will use the second method just because it's newer. Thank you for the answer! – Sep 10 '18 at 14:31
another advantage of the second method is that it works even if `link` is not a string. if it's a `Selector` object wrapping an `a` element, `.follow` will extract the `href` from that element for you. – Valdir Stumm Junior Sep 10 '18 at 17:48

How to get the hostname of the request?

1 Answers1