1

i have a problem about web robot application.

url a:http://www.domain.com/path?id=1

url b:http://www.domain.com/path?id=1&sessionid=XXXXXX

there two urls and forward to the same page.robot application download the page twice.

in my robot application, two URL convert to md5 value to check is visited . but url string is changed, so md5 value also changed. visited cache can not hit.

have any better solution?

g.yeung
  • 11
  • 3
  • Remove the parameters before hashing to md5? – Thomas Jungblut May 19 '13 at 15:06
  • @ThomasJungblut - Probably, `id=2` would lead to another page. – Egor Skriptunoff May 19 '13 at 15:08
  • And you are implementing it in...? C# can offer you the last URI that the client was redirected to. – Novak May 19 '13 at 15:29
  • @Egor Skriptunoff - you are right. send request with parameter like id=1 or id=2 would return right page content. but even if url With id parameter never changed and With other one or more parameter like sessionid=XXXXXX, checked=true and so on. the url unique key (MD5) would be changed. – g.yeung May 20 '13 at 02:23

1 Answers1

0

If I were you I'd use an algorithm to calculate the similarity of the content, and if they are similar than a configured threshold, consider them the same document. Checking for absolute equality (like MD5SUM), won't work because dynamic contents (like a timestamp) would break such scheme.

Using document similarity is a common approach in web crawling to prevent robots downloading essentially same content over and over.

A very simple similarity algorithm like Levenshtein Distance could to do the job, but something like cosine similarity would be better suited for this.

Community
  • 1
  • 1
Enno Shioji
  • 26,542
  • 13
  • 70
  • 109
  • thank you for your answering. I get inspired. but when i calculate the page content like (id=1, id=1&sessionid=xxx), the robot have downloaded page. content hashing md5 value absolutely equality, maybe i should add a layer to collect urls reference to document and unique url(i think is shortest url)? when new url is coming , check the url and multiple urls reference to unique url. but parameter like sessionid=xxx could change endless. – g.yeung May 20 '13 at 02:44
  • @gao.yang: I think you want to use the actual page content (as opposed to the url) to calculate the document similarity. As in the actual html and content within. This is expensive but given that the web is so chaotic, I can't think of a way that can handle this without using the actual raw contents. – Enno Shioji May 20 '13 at 08:52