2

For a search bot, I am working on a design to:
* compare URIs and
* determine which URIs are really the same page

Dealing with redirects and aliases:
Case 1: Redirects
Case 2: Aliases e.g. www
Case 3: URL parameters e.g. sukshma.net/node#parameter

I have two approaches I could follow, one approach is to explicitly check for redirects to catch case #1. Another approach is to "hard code" aliases such as www, works in Case #2. The second approach (hard-code) aliases is brittle. The URL specification for HTTP does not mention the use of www as an alias (RFC 2616)

I also intend to use the Canonical Meta-tag (HTTP/HTML), but if I understand it correctly - I cannot rely on the tag to be there in all cases.

Do share your own experience. Do you know of a reference white paper implementation for detecting duplicates in search bots?

Christopher Orr
  • 110,418
  • 27
  • 198
  • 193
Santosh
  • 21
  • 2

2 Answers2

0

Building your own web crawler is a lot of work. Consider checking out some of the open source spiders already available, like JSpider, OpenWebSpider or many others.

Community
  • 1
  • 1
Chris Fulstow
  • 41,170
  • 10
  • 86
  • 110
  • I get where you are going, however - I need the technology and know-how for duplicated detection (and not just for crawls). Would you know if these projects have resolved that successfully? – Santosh Dec 11 '09 at 06:02
  • Despite my own advice, I've built my own crawler and stored a checksum for every crawled page. If a page was a potential duplicate of another, based on its URL or other criteria, then I compared the checksums to check. – Chris Fulstow Dec 11 '09 at 06:22
0

The first case would be solved by simply checking the HTTP status code.

For the 2nd and 3rd cases Wikipedia explains it very well: URL Normalization / Canonicalization.

Alix Axel
  • 151,645
  • 95
  • 393
  • 500