How can i obtain a domain name with Scrapy?

Question

I know there is a command in html : var x = document.domain; that gets the domain but how can i implement this in Scrapy so i can obtain domain names ?

score 18 · Accepted Answer · edited May 23 '17 at 11:33

18

You can extract the domain name from the response.url:

from urlparse import urlparse

def parse(self, response):
    parsed_uri = urlparse(response.url)
    domain = '{uri.scheme}://{uri.netloc}/'.format(uri=parsed_uri)
    print domain

edited May 23 '17 at 11:33

Community

1
1

answered Aug 13 '15 at 17:41

alecxe

462,703
120
1,088
1,195

score 7 · Answer 2 · answered Feb 05 '20 at 23:23

For Python3, two very minor changes to 'from' and 'print'. alecxe's answer is good for Python2.

Also, for Scrapy's CrawlSpider, change the name 'parse' above to something else because CrawlSpider uses 'parse' for itself.

from urllib.parse import urlparse

def get_domain(self, response):
    parsed_uri = urlparse(response.url)
    domain = '{uri.scheme}://{uri.netloc}/'.format(uri=parsed_uri)
    print(domain)
    return domain

Then you can use it, as the OP's example

x = get_domain

Or for my case, I wanted to pass the domain to Scrapy's CrawlSpider's Rule's LinkExtractor's allow_domains. Phew. This limits the crawl to that domain.

rules = [ 
    Rule( 
        LinkExtractor( 
            canonicalize=True, 
            unique=True,
            strip=True,
            allow_domains=(domain)
        ), 
        follow=True, 
        callback="someparser" 
    ) 
]

score 0 · Answer 3 · answered Feb 06 '20 at 06:00

0

Try:

_rl = response.url
url = _rl.split("/")[2]

print (url)

answered Feb 06 '20 at 06:00

Janib Soomro

446
6
12

How can i obtain a domain name with Scrapy?

3 Answers3