I know there is a command in html : var x = document.domain;
that gets the domain but how can i implement this in Scrapy so i can obtain domain names ?
Asked
Active
Viewed 4,838 times
7

Prometheus
- 999
- 2
- 15
- 28
3 Answers
18
You can extract the domain name from the response.url
:
from urlparse import urlparse
def parse(self, response):
parsed_uri = urlparse(response.url)
domain = '{uri.scheme}://{uri.netloc}/'.format(uri=parsed_uri)
print domain
7
For Python3, two very minor changes to 'from' and 'print'. alecxe's answer is good for Python2.
Also, for Scrapy's CrawlSpider, change the name 'parse' above to something else because CrawlSpider uses 'parse' for itself.
from urllib.parse import urlparse
def get_domain(self, response):
parsed_uri = urlparse(response.url)
domain = '{uri.scheme}://{uri.netloc}/'.format(uri=parsed_uri)
print(domain)
return domain
Then you can use it, as the OP's example
x = get_domain
Or for my case, I wanted to pass the domain to Scrapy's CrawlSpider's Rule's LinkExtractor's allow_domains. Phew. This limits the crawl to that domain.
rules = [
Rule(
LinkExtractor(
canonicalize=True,
unique=True,
strip=True,
allow_domains=(domain)
),
follow=True,
callback="someparser"
)
]

Saj
- 781
- 8
- 7