I'm looking to scrape all urls/text content & crawl on specific domains.
I've seen a method of of scraping urls (retrieve links from web page using python and BeautifulSoup)
I also tried the following code of staying on specific domains, but it doesn't seem to work completely.
domains = ["newyorktimes.com", etc]
p = urlparse(url)
print(p, p.hostname)
if p.hostname in domains:
pass
else:
return []
#do something with p
My main problem is making sure the crawler stays on the specified domain, but I'm not sure how to do that when urls may have different paths/fragments. I know how to scrape the urls from a given website. I'm open to using BeautifulSoup, lxml, scrapy, etc
This question might be a bit too broad, but I've tried searching about crawling within specific domains, but I couldn't find too relevant stuff:/
Any help/resources will be greatly appreciated!
Thanks