7

I know this question has been answered before in this thread, but I couldn't seem to find the details.

In my scenario, I am building a console application which will keep an eye on html page source for any changes. If any update/change occurs, I will perform further operations. Moreover, I'll also perform a request after every 1 second, or as soon as the previous request finishes.

I can't seem to figure out should I use HttpWebRequest or WebClient for downloading the html page source and perform comparison? What do you think would be an ideal solution in my case? Speed and reliability both :)

Community
  • 1
  • 1
code master
  • 2,026
  • 5
  • 30
  • 49

2 Answers2

15

I'd go with HttpWebRequst because it's not as abstracted and lets you fiddle with HTTP params quite a bit. It gives you the option to not download the entire page if the server returns "file not changed", for example.

If you add some parameters to your request like IfModifiedSince (it might be HEAD or GET request) the server may return the response code 304 - NOT MODIFIED. Refer to description of caching in HTTP for further explanation.

The point is to make sure that you only download the full page when it's actually modified since the last time you fetched it. Most of the time it won't be changed (I suppose, can't know for sure without knowing your domain), so you only need to get a lightweight response from server which simply states "nothing changed here".

Update: code sample demonstrating the use of IfModifiedSince property:

bool IsResourceModified(string url, DateTime dateTime) {            
    try {
        var request = (HttpWebRequest)HttpWebRequest.Create(new Uri(url));
        request.IfModifiedSince = dateTime;
        request.Method = "HEAD";
        var response = (HttpWebResponse)request.GetResponse();

        return true;
    }
    catch(WebException ex) {
        if(ex.Status != WebExceptionStatus.ProtocolError)
            throw;

        var response = (HttpWebResponse)ex.Response;
        if(response.StatusCode != HttpStatusCode.NotModified)
            throw;

        return false;    
    }
}

This method should return true if the page was modifed since the dateTime date and false if it wasn't. GetResponse method will throw a WebException if you make a HEAD-request and the server returns 304 - NOT MODIFIED (which is kinda unfortunate). We have to make sure that it's not some other web connection problem, that's why I check the status of web exception and the HTTP status in response. If anything else caused an exception we just throw it further.

Console.WriteLine(IsResourceModified("http://example.com", new DateTime(2009)));
Console.WriteLine(IsResourceModified("http://example.com", DateTime.Now));

This sample code produces the output:

True
False

Note: make sure to read Jim Mischel's addition to this answer as he gives few good advices on this technique.

Community
  • 1
  • 1
Dyppl
  • 12,161
  • 9
  • 47
  • 68
  • Can you please post an example for "file not changed" scenario? – code master Jun 04 '11 at 21:31
  • 1
    @Free Styler - There is an HTTP status code - [304 Not Modified](http://www.checkupdown.com/status/E304.html). – Oded Jun 04 '11 at 21:33
  • 1
    @Free Styler: I added it to my answer – Dyppl Jun 04 '11 at 21:38
  • I agree with the idea. The page will not change for most of the time. What do you think I should set the minimum threshold time between two requests? – code master Jun 04 '11 at 21:49
  • @Free Styler: it depends entirely on the content of the page and how often it gets updated and how relevant you want your data, really, so it's hard for me to guess. Maybe to check for new version (without downloading data unless necessary) every minute will suffice. – Dyppl Jun 04 '11 at 21:58
  • @Dyppl I tried that but the I'm only getting the Status code as "OK", not getting Not Modified. here is my source code : HttpWebRequest request = HttpWebRequest.Create(new Uri(url)) as HttpWebRequest; request.Method = "GET"; HttpWebResponse response = (HttpWebResponse) request.GetResponse(); – code master Jun 04 '11 at 22:03
  • 1
    @Free Styler: I suppose you have to add some extra info in your request. Take a look at `IfModifiedSince` property: http://msdn.microsoft.com/en-us/library/system.net.httpwebrequest.ifmodifiedsince.aspx – Dyppl Jun 04 '11 at 22:11
  • thanks for the hint, don't you think we also need to set the IfModifiedSince property before assigning it to the request? – code master Jun 04 '11 at 22:19
  • @Free Styler: yes, that's what I meant, you should set the value of `IfModifiedSince` property of your request object to the last time you fetched the page and only then call `GetResponse` – Dyppl Jun 04 '11 at 22:21
  • @Dyppl I still believe, I haven't understood this property quite well. Would you mind posting some sample code. I would appreciate with big heart :)) – code master Jun 05 '11 at 00:00
  • @Free Styler: I added it to my answer – Dyppl Jun 05 '11 at 07:52
  • I tried that but interesting thing is that it always return true no matter what time you pass. Here is the url that I'm passing during the call http://www.wallenstam.se/boende/Lediga-bostader/Lediga-bostader1/Helsingborg/ – code master Jun 06 '11 at 14:11
  • 1
    @Free Styler: it's possible for web-server to have this feature turned off for some pages or for an entire site. In that case I suppose you'll have to download the whole thing to check – Dyppl Jun 06 '11 at 14:16
  • In this case, what way would you suggest me to check the html source for any updates/changes? – code master Jun 06 '11 at 14:17
  • @Free Styler: don't know, you'd have to save some pages and see what differs and what doesn't – Dyppl Jun 06 '11 at 15:28
9

I was going to leave this as a comment to @Dyppl's response, but it became too long.

Dyppl's response is generally good advice, and the way that I would approach this problem. However, there are a few things you should keep in mind.

First, there's no reason to do a HEAD request, followed by a GET if the page has been modified. You can do a GET with the IfModifiedSince header set, and the server will either return the entire page or a 304. Doing the HEAD first, followed by the 'GET`, ends up making two requests to the server, which defeats much of the purpose of the conditional request.

Second, you should set the IfModifiedSince property to the LastModified value returned by the previous response (i.e. HttpWebResponse.LastModified) because the server's time might not be synchronized with your computer. Also, I've found that a large percentage of sites, particularly those with generated content (like WordPress blogs) lie. They always return the current date/time in the LastModified header. As a result, there is no benefit to doing the If-Modified-Since check on those sites.

If you know that the site lies and always returns the current date/time, you can keep track of the ContentLength header that's returned from the page when you download it. Then, when you want to check to see if the page has changed, do a HEAD request and check the returned ContentLength header with the saved value. If they match, then it's unlikely that the page has changed. If they don't match, then do a GET request to update your copy of the page and keep the new ContentLength.

This technique does have the disadvantage of requiring two requests if the page has changed. It's also not 100% reliable on all servers. Some will return a different ContentLength for the HEAD request, and some don't return a valid ContentLength at all. That said, I've found it to be effective for a large number of sites.

Jim Mischel
  • 131,090
  • 20
  • 188
  • 351
  • Hi Jim, thank for reply. My question is why shouldn't I make a single GET request and check the content length for changes, instead of make two request? – code master Jun 06 '11 at 15:54
  • @Free Styler: Because when you make the GET request, the server will start streaming the data to you. Even if you don't actually read the data, the server will be sending it and it will arrive at your machine. You end up downloading a lot of data that you don't need. That's not a problem if you're doing a few pages once in a while, but if you're talking about checking 30 pages per second, it's something to think about. – Jim Mischel Jun 06 '11 at 23:18
  • Thanks, these are good advices, I added a link to your post in my answer. I don't think that using `HEAD` is much of a problem though as you only make 2 requests when something *has* changed, which should be a relatively rare event (although this is completely domain-specific). – Dyppl Aug 21 '12 at 05:06