Compare URIs for a search bot?

Question

For a search bot, I am working on a design to:
* compare URIs and
* determine which URIs are really the same page

Dealing with redirects and aliases:
Case 1: Redirects
Case 2: Aliases e.g. www
Case 3: URL parameters e.g. sukshma.net/node#parameter

I have two approaches I could follow, one approach is to explicitly check for redirects to catch case #1. Another approach is to "hard code" aliases such as www, works in Case #2. The second approach (hard-code) aliases is brittle. The URL specification for HTTP does not mention the use of www as an alias (RFC 2616)

I also intend to use the Canonical Meta-tag (HTTP/HTML), but if I understand it correctly - I cannot rely on the tag to be there in all cases.

Do share your own experience. Do you know of a reference white paper implementation for detecting duplicates in search bots?

Really, why would you think that? Is it the way I wrote the question out? — Santosh, Dec 13 '09 at 03:56

score 0 · Answer 1 · edited May 23 '17 at 12:13

0

Building your own web crawler is a lot of work. Consider checking out some of the open source spiders already available, like JSpider, OpenWebSpider or many others.

edited May 23 '17 at 12:13

Community

1
1

answered Dec 11 '09 at 03:54

Chris Fulstow

41,170
10
86
110

I get where you are going, however - I need the technology and know-how for duplicated detection (and not just for crawls). Would you know if these projects have resolved that successfully? – Santosh Dec 11 '09 at 06:02
Despite my own advice, I've built my own crawler and stored a checksum for every crawled page. If a page was a potential duplicate of another, based on its URL or other criteria, then I compared the checksums to check. – Chris Fulstow Dec 11 '09 at 06:22

score 0 · Answer 2 · answered Jan 15 '10 at 01:08

0

The first case would be solved by simply checking the HTTP status code.

For the 2nd and 3rd cases Wikipedia explains it very well: URL Normalization / Canonicalization.

answered Jan 15 '10 at 01:08

Alix Axel

151,645
95
393
500

Compare URIs for a search bot?

2 Answers2