4

I'm trying to write a little utility which will check periodically and tell me if/when a web page's (could be any URL) content has changed. I've read the other postings but they don't really answer my question (as far as I can tell).

I know for static pages there is a last-modified header. However, what about dynamic pages? I got Oli's comment that storing a hash of the contents works but that's not really idea because the page might simply have a time stamp on it (the date-time that the page was produced). Clearly, in this case, the content would be different on every single request even though nothing significant has changed.

So, now I'm thinking to tie it to a percentage of 'changedness.' Something like, more than 5% changed will cause the 'changed' logic to run.

I'd love to hear any ideas on how I can reliably tell when a page has changed, in a meaningful way.

John
  • 1,124
  • 1
  • 11
  • 27

3 Answers3

3

One solution is to determine the parts of a dynamic page that are static that you would consider 'changed' if they are updated. Using a diff tool (example below) to compare the original page source to updated page source. However, determining these parts manually for every instance of a page would not necessarily scale well if you have more than a few dozen pages.

Two ideas:

1) Use HTMLAgilityPack (.NET Library) to parse the page DOM and perform a count of distinct page elements for both the stored, previously scanned page and the recently scanned version of it. Use a formula that you deem satisfactory to flag a 'change'. A very simple example would be the old copy has 8 anchors <a> tags and the new one only has 5.

2) Use a diffing library DiffPlex http://diffplex.codeplex.com/ to determine word and line changes. You will need to come up with, through analysis, a change base line for word and line additions that would trigger a valid 'change'.

        var d = new Differ();
        var inlineBuilder = new InlineDiffBuilder(d);
        var result = inlineBuilder.BuildDiffModel(OldText, NewText);
        int inserted, deleted, modified = 0;
        foreach (var line in result.Lines)
        {

            if(line.Type == ChangeType.Inserted)
                inserted++;
            else if(line.Type == ChangeType.Deleted)
               deleted++;
            else if (line.Type == ChangeType.Modified)
                modified++;


        }
        // some base line formula/threshold you come up with through analysis
        if (deleted + inserted + modifed > 10)
           changed = true;
    }
wp78de
  • 18,207
  • 7
  • 43
  • 71
jdmonty
  • 2,293
  • 1
  • 14
  • 11
0

You won't need to write your own code to do this. There are many, many examples of different implementations of diff. Diff will tell you way more than what you need (it tells you what specifically has changed), but it should solve your problem.

colithium
  • 10,269
  • 5
  • 42
  • 57
0

You might want to consider using the Levenshtein Distance when determining the difference between the new version of the page and what you have stored.

http://en.wikipedia.org/wiki/Levenshtein_distance

BoltBait
  • 11,361
  • 9
  • 58
  • 87