I'm trying to write a little utility which will check periodically and tell me if/when a web page's (could be any URL) content has changed. I've read the other postings but they don't really answer my question (as far as I can tell).
I know for static pages there is a last-modified header. However, what about dynamic pages? I got Oli's comment that storing a hash of the contents works but that's not really idea because the page might simply have a time stamp on it (the date-time that the page was produced). Clearly, in this case, the content would be different on every single request even though nothing significant has changed.
So, now I'm thinking to tie it to a percentage of 'changedness.' Something like, more than 5% changed will cause the 'changed' logic to run.
I'd love to hear any ideas on how I can reliably tell when a page has changed, in a meaningful way.