Suppose I have 2 HTML sources. I want to compare these and if they differ more than a given percentage value I want to do something with the new HTML.
For example, if the 2 HTML pages differ 5% or more, I want to e-mail somebody.
How can I do this in Java? Is there a library for this?

- 8,198
- 71
- 51
- 66

- 14,961
- 30
- 95
- 179
-
1what kind of difference are you talking about? – Çağdaş May 11 '11 at 14:45
-
http://stackoverflow.com/questions/141993/best-way-to-compare-2-xml-documents-in-java might answer your question – VirtualTroll May 11 '11 at 14:46
-
The posted question doesn't answer this one, sorry. I'm talking about any kind of difference in HTML source. For example if two web pages just differ from date/time in the top of the page, it should return something like 0.1% difference. – Alireza Noori May 11 '11 at 17:22
2 Answers
Our Smart Differencer tool might be helpful here.
This tool compares the structure of "code" (various languages, HTML being one) and produces a "diff" like output but it is focused on code differences rather than just raw text differences, using language-specific (but somewhat limited) knowledge about what is really different. So, if you swapped the placement of two attributes in a tag, it would say there was no difference.
The diff output tells you what code blocks have been deleted, inserted, moved or copied complete with substitutions detectable according to language structure. (For HTML, any change in normally displayed text is considered a replacement; it doesn't do diff on such text strings). You'd have to scan the tool output to collect your "overall change" statistics, so this woldn't conceptually be different than doing the same thing with cygwin diff, but the numbers would likely be more precise. YMMV.

- 93,541
- 22
- 172
- 341
-
Thanks for the answer. I think I should use more sophisticated tools for my project now. (Tools like block-based difference, etc.) I'll look more into your project too, but anyway I select this as the answer since this is better than the other answer. Thanks again. – Alireza Noori Jul 18 '12 at 18:46
The cheap and nasty way to do this is to run everything through an HTML tidier, remove insignificant whitespace, then insert line-breaks before every '<' character. You can run the resulting text through a standard line-based diff utility to give you a rough difference metric which is "good enough", in my experience.

- 3,248
- 2
- 21
- 27
-
I don't want to do something like this. There are a few tools like [DaisyDiff](code.google.com/p/daisydiff) that represent the diff in a HTML. The main implementaion is HTMLDiff and there are lots of Web Tracking tools that do something like what I want but I'm looking for a library that instead of representing the difference in another HTML, just tell me how much change it detected. – Alireza Noori May 11 '11 at 17:29
-
Given what I described, "how much changed" is trivial - it's the line count of the diff divided by the line count of the original. – regularfry May 12 '11 at 10:54
-
The problem with your solution is that XML and HTML documents are tree-based documents not line-based. So we have to compare nodes (not lines). – Alireza Noori May 12 '11 at 20:10
-
That's the whole point. By splitting on a '<' character, you know that any lines you detect with a change don't cross node boundaries. Like I say, it's "good enough" to give a rough metric. You'll have to define what you mean by "percentage difference" if you want a better suggestion. – regularfry May 16 '11 at 08:50
-
That's like you said, "good enough" but I didn't use it. I used [DaisyDiff](http://code.google.com/p/daisydiff), used it's XML output and converted it to the percentage. Anyway I'm going to set your answer as the answer. – Alireza Noori May 16 '11 at 22:29