1

How would you go about crawling a website such that you can index every page when there is only really a search bar for navagation like the following sites.

https://plejehjemsoversigten.dk/

https://findadentist.ada.org/

Do people just brute force the search queries, or is there a method that's usually implemented to index these kinds of websites?

SketchyManDan
  • 176
  • 1
  • 10

1 Answers1

3

There could be several ways to approach your issue (however if the owner of a resource does not want the resource to be crawled, that might be really challenging)

  • Check robots.txt of a resource. It might give you a clue on the site structure.
  • Check sitemap.xml of a resource. It might give URLs of the pages a resource owner wishes to be public
  • Use alternative indexers (like google). Use advanced syntax narrowing the scope of search to a particular site (like site:your.domain)
  • Use breaches in site design. For example first site from your list does not have a minimal search string so that you can search for, say, a and get 800 results containing a. Then list remaining letters.
  • Having search result crawl all the links on the search result items pages since there often might be related pages listed.
Alexey R.
  • 8,057
  • 2
  • 11
  • 27