6

How can I bring google-like recrawling in my application(web or console). I need only those pages to be recrawled which are updated after a particular date.

The LastModified header in the System.Net.WebResponse gives only the current date of the server. For example if I downloaded one page with HTTPWebRequest on 27 January 2012, and check the header for the LastModified date, it is showing the current time of the server when the page was served. In this case it is 27 January 2012 only.

Can anyone suggest any other methods?

Kara
  • 6,115
  • 16
  • 50
  • 57
Sunil Raj
  • 460
  • 5
  • 20
  • Are you trying to crawl pages whose content changed or the resource on the server changed? It's an important distinction... if you're trying to detect a change in content, then "LastModified" will not give you that information as the content is dynamically served. – Kiril Jan 27 '12 at 16:28
  • I want to schedule a crawl process on particular interval, say 10 days. While recrawling I want to crawl only those pages which were modified after my previous crawl event. – Sunil Raj Jan 30 '12 at 04:49
  • You didn't really answer my question... there is a difference between when a page last change and when its content changed. The content of the page can change without the actual server resource (i.e. the page) changing. So which is it, the page or the content of the page? – Kiril Jan 30 '12 at 14:18
  • The content(if possible to detect the change in it without downloading the content). – Sunil Raj Jan 31 '12 at 03:57

1 Answers1

8

First, to point out here is that what you're trying to do is very difficult and there is a great deal of research-level papers that try to address it (I will give you links to a few of them a little later). There is no way to see if a site has changed without crawling it, although you can have shortcuts like checking the Content-Length from the response header without downloading the rest of the page. This will allow your system to save on traffic, but it won't resolve your problem in a manner that's really useful.

Second, since you're concerned about content, then Last-Modified header field will not be very useful for you and I would even go as far as to say that it will not be useful at all.

And third, what you're describing has somewhat conflicting requirements, because you're interested in crawling only the pages that have updated content and that's not exactly how Google does things (yet, you want google-like crawling). Google's crawling is focused on providing the freshest content for the most frequently searched/visited websites. For example: Google has very little interest in frequently crawling a website that updates its content twice a day when that website has 10 visitors a day, instead Google is more interested in crawling a website that gets 10 million visitors a day even if its content updates less frequently. It may be also true that websites that update their content frequently also have a lot of visitors, but from Google's perspective that's not exactly relevant.


If you have to discover new websites (coverage) and at the same time you want to have the latest content of the sites you know about (freshness), then you have conflicting goals (which is true for most crawlers, even Google). Usually what ends up happening is that when you have more coverage you have less freshness and if you have more freshness then you have less coverage. If you're interested in balancing both, then I suggest you read the following articles:

The summary of the idea is that you have to crawl a website several times (maybe several hundred times) in order for you to build up a good measure of its history. Once you have a good set of historical measures, then you use a predictive model to interpolate when will the website change again and you schedule a crawl for some time after the expected change.

Kiril
  • 39,672
  • 31
  • 167
  • 226