I would like to compare two HTML documents represented as String using Jsoup, disregarding any differences in whitespaces.
Simplified example:
@Test
public void testCompare() {
Document doc1 = Jsoup.parse("<html><body><div>Hello</div>\n</body></html>");
Document doc2 = Jsoup.parse("<html><body><div>Hello</div>\n</body>\n</html>");
System.out.println("Document 1");
System.out.println("----------");
for (Node node : doc1.body().childNodes()) {
printNode(node);
}
System.out.println();
System.out.println("Document 2");
System.out.println("----------");
for (Node node : doc2.body().childNodes()) {
printNode(node);
}
assertTrue("HTML documents are different", doc1.hasSameValue(doc2));
}
private void printNode(Node node) {
String text = node.getClass().getSimpleName();
if (node instanceof TextNode) {
TextNode textNode = (TextNode) node;
text += ": '" + textNode.getWholeText().replaceAll("\n", "\\\\n") + "'";
}
System.out.println(text);
}
The only difference between the two documents is the additional new line in the second document after the body tag.
The resulting child nodes in the body tag differ. The first document has a element node and a text node (containing a new line). The second document contains the same two nodes, but has an additional text node containing another new line. This additional text node might be a result of the document normalization (moving text node outside the body tag to the body, see Javadoc of Document#normalise). Node#hasSameValue uses outerHtml which will handle subsequent whitespaces in a single text node, but not accross two different successive ones.
How can I achieve this?
The solution must not use Jsoup if there are better alternatives to reach the same goal.