Compare two HTML documents using Jsoup (Java)

Question

I would like to compare two HTML documents represented as String using Jsoup, disregarding any differences in whitespaces.

Simplified example:

@Test
public void testCompare() {
  Document doc1 = Jsoup.parse("<html><body><div>Hello</div>\n</body></html>");
  Document doc2 = Jsoup.parse("<html><body><div>Hello</div>\n</body>\n</html>");

  System.out.println("Document 1");
  System.out.println("----------");
  for (Node node : doc1.body().childNodes()) {
    printNode(node);
  }

  System.out.println();

  System.out.println("Document 2");
  System.out.println("----------");
  for (Node node : doc2.body().childNodes()) {
    printNode(node);
  }

  assertTrue("HTML documents are different", doc1.hasSameValue(doc2));
}

private void printNode(Node node) {
  String text = node.getClass().getSimpleName();
  if (node instanceof TextNode) {
    TextNode textNode = (TextNode) node;
    text += ": '" + textNode.getWholeText().replaceAll("\n", "\\\\n") + "'";
  }
  System.out.println(text);
}

The only difference between the two documents is the additional new line in the second document after the body tag.

The resulting child nodes in the body tag differ. The first document has a element node and a text node (containing a new line). The second document contains the same two nodes, but has an additional text node containing another new line. This additional text node might be a result of the document normalization (moving text node outside the body tag to the body, see Javadoc of Document#normalise). Node#hasSameValue uses outerHtml which will handle subsequent whitespaces in a single text node, but not accross two different successive ones.

How can I achieve this?

The solution must not use Jsoup if there are better alternatives to reach the same goal.

HTML or not, you have 2 strings. Maybe you can use this: https://stackoverflow.com/questions/18344721/extract-the-difference-between-two-strings-in-java — canillas, Nov 15 '17 at 15:13

Volodymyr Masliy · Answer 1 · 2018-10-27T12:12:57.453

0

If you treat both htmls as strings, you could do something like this:

Function<String, String> normalizer = (original) ->
    original
        .replaceAll("[\\s+]?\n+[\\s+]?", "") // remove newline chars
        .replaceAll("(>)(\\s+)(<)", "$1$3") // remove white space between tags
        .toLowerCase();
String html1 = normalizer.apply(doc1.html());
String html2 = normalizer.apply(doc2.html());
Assert.assertEquals("Both documents are identical", html1, html2);

But, keep in mind, that this test checks only for exact match. If tags, attributes or other data isn't in the same order, it will fail.

edited Oct 27 '18 at 12:12

answered Oct 27 '18 at 10:58

Volodymyr Masliy

413
4
14

Thanks for your input. `Node#hasSameValue` will take into account the order of attributes as well (because only working with `outerHtml` for comparison), thus either some form of normalization or an own way of comparing the nodes with each other (thus not using `hasSameValue`) is required anyway. My current approach is to apply normalization on Node level (and not on resulting HTML): remove all comment nodes and all nodes that do not contain text (`TextNode#isBlank`), order attributes, remove certain empty attributes (class, style), order css classes within class attribute and some more. – Stephan Merkli Nov 05 '18 at 14:25

score 0 · Answer 2 · answered Nov 06 '19 at 06:29

I had a similar requirement. I achieved it by the following,

You can create a shell script with vimdiff command to compare the two files and export the side by side comparison as a html file
You can use python diflib to get the differences between the two html files.

Compare two HTML documents using Jsoup (Java)

2 Answers2