0

I'm looking to scrape all urls/text content & crawl on specific domains.

I've seen a method of of scraping urls (retrieve links from web page using python and BeautifulSoup)

I also tried the following code of staying on specific domains, but it doesn't seem to work completely.

domains = ["newyorktimes.com", etc]
p = urlparse(url)
print(p, p.hostname)
if p.hostname in domains:
    pass
else:
    return []

#do something with p

My main problem is making sure the crawler stays on the specified domain, but I'm not sure how to do that when urls may have different paths/fragments. I know how to scrape the urls from a given website. I'm open to using BeautifulSoup, lxml, scrapy, etc

This question might be a bit too broad, but I've tried searching about crawling within specific domains, but I couldn't find too relevant stuff:/

Any help/resources will be greatly appreciated!

Thanks

Anon Li
  • 561
  • 1
  • 6
  • 18

1 Answers1

0

Try this.

from simplified_scrapy.spider import Spider, SimplifiedDoc
class MySpider(Spider):
  name = 'newyorktimes.com'
  allowed_domains = ['newyorktimes.com','nytimes.com']
  # concurrencyPer1s=1
  start_urls = 'https://www.newyorktimes.com'
  refresh_urls = True # For debug. If efresh_urls = True, start_urls will be crawled again.

  def extract(self, url, html, models, modelNames):
    doc = SimplifiedDoc(html)
    lstA = doc.listA(url=url['url'])
    return {"Urls": lstA, "Data": None} # Return data to framework

from simplified_scrapy.simplified_main import SimplifiedMain
SimplifiedMain.startThread(MySpider()) # Start crawling

Here are more examples: https://github.com/yiyedata/simplified-scrapy-demo/tree/master/spider_examples

dabingsou
  • 2,469
  • 1
  • 5
  • 8