0

Again I come to you guys for your expertise and advice on an issue that I am having. I was wondering if any of you would know how to detect if a web page has been modified using VB.NET. I need to be able to set up a task which periodically (like once a week) scans the user inputted web pages and if the web page content has changed, I need to fire off an email to an individual that it has changed (not the exact location on the page itself). I'll be storing the HTTP status and of course the page data itself as well as the date of when it was last modified. Of course this needs to be very fault tolerant since it could be another week before the check runs again. Any help would be great. Thank you.

EDIT

New twist on this question sorry. I had more time to think about what we wanted. So... Detecting ANY change on a web page would be kind of silly since time dependent elements of the page would change every so often. Instead, what I would like to do is be able to detect the documents in the page. For instance if there are excel, word docs, or pdfs that get changed on that page. So, I'd run the hash on these documents then on some sort of schedule do a check to see if new documents have been added or if the old documents have been modified. Any suggestions on how to detect the documents embedded on the page and running the hash? Thanks again!

New Guy
  • 566
  • 1
  • 8
  • 28
  • what kind of pages you will have .aspx or .html? and where you compare whether pages are changed or not? – Jalpesh Vadgama Jul 19 '13 at 12:19
  • It could be .asp/.aspx or .html. As for where to compare if a page has changed or not, it should be kept as binary, it could be kept as html, pdf, word doc etc... I'm not sure how to do the comparison though – New Guy Jul 19 '13 at 12:53
  • This is almost exactly what [checksums](http://en.wikipedia.org/wiki/Checksum) were designed for. – RoadieRich Jul 19 '13 at 12:54

2 Answers2

3

As I mentioned in a comment, this sort of job is what checksums (also known as hash functions) were designed for.

You code for will look something like this:

- for each webpage of interest
  - pull webbpage
  - calculate checksum of contents
  - is current checksum different to last checksum?
    - if yes, send email
  - store new checksum and other appropriate data

The .Net framework has a number of checksums available. The two most popular are MD5 and sha1

RoadieRich
  • 6,330
  • 3
  • 35
  • 52
  • yea, checksum sounds right. My only problem is actually getting the content of the page itself. That's my biggest issue. – New Guy Jul 19 '13 at 13:18
  • @NewGuy See [this question](http://stackoverflow.com/questions/929808/how-do-i-download-a-webpage-into-a-stream-in-net) – RoadieRich Jul 19 '13 at 13:25
  • So, some things changed now that I got a better idea of what we are looking for. Please read the edit above. Thank you! – New Guy Jul 29 '13 at 13:25
  • @NewGuy You'd be better off creating a new question, linking back to this one. – RoadieRich Jul 29 '13 at 13:34
2

In addition to the checksum option, there are also various Diff function that achieve this, and provide much more information than changed=true/false. This question has more info:

How to tell when a web page has changed by x% in VB.net?

Community
  • 1
  • 1
Travis
  • 1,044
  • 1
  • 17
  • 36
  • Thank you for this alternative. I will keep this in mind if we want the specifics of a web page. – New Guy Jul 19 '13 at 13:32