7

I know there is a command in html : var x = document.domain; that gets the domain but how can i implement this in Scrapy so i can obtain domain names ?

Prometheus
  • 999
  • 2
  • 15
  • 28

3 Answers3

18

You can extract the domain name from the response.url:

from urlparse import urlparse

def parse(self, response):
    parsed_uri = urlparse(response.url)
    domain = '{uri.scheme}://{uri.netloc}/'.format(uri=parsed_uri)
    print domain
Community
  • 1
  • 1
alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
7

For Python3, two very minor changes to 'from' and 'print'. alecxe's answer is good for Python2.

Also, for Scrapy's CrawlSpider, change the name 'parse' above to something else because CrawlSpider uses 'parse' for itself.

from urllib.parse import urlparse

def get_domain(self, response):
    parsed_uri = urlparse(response.url)
    domain = '{uri.scheme}://{uri.netloc}/'.format(uri=parsed_uri)
    print(domain)
    return domain

Then you can use it, as the OP's example

x = get_domain

Or for my case, I wanted to pass the domain to Scrapy's CrawlSpider's Rule's LinkExtractor's allow_domains. Phew. This limits the crawl to that domain.

rules = [ 
    Rule( 
        LinkExtractor( 
            canonicalize=True, 
            unique=True,
            strip=True,
            allow_domains=(domain)
        ), 
        follow=True, 
        callback="someparser" 
    ) 
]  
Saj
  • 781
  • 8
  • 7
0

Try:

_rl = response.url
url = _rl.split("/")[2]

print (url)
Janib Soomro
  • 446
  • 6
  • 12