There are a number of Java libraries for handling HTML input that isn't well-formed (according to XML). These libraries also have built-in methods for querying or manipulating the document, but it's important to realize that once you've parsed the document it's usually pretty easy to treat it as though it were XML in the first place (using the standard Java XML interfaces). In other words, you only need these libraries to parse the malformed input; the other utilities they provide are mostly superfluous.
Here's an example that shows parsing HTML using HTMLCleaner and then converting that object into a standard org.w3c.dom.Document
:
TagNode tagNode = new HtmlCleaner().clean("<html><div><p>test");
DomSerializer ser = new DomSerializer(new CleanerProperties());
org.w3c.dom.Document doc = ser.createDOM(tagNode);
In Jsoup, simply parse the input and serialize it into a string:
String text = Jsoup.parse("<html><div><p>test").outerHtml();
And convert that string into a W3C Document using one of the methods described here:
You can now use the standard JAXP interfaces to transform this document:
TransformerFactory tFact = TransformerFactory.newInstance();
Transformer transformer = tFact.newTransformer();
Source source = new DOMSource(doc);
Result result = new StreamResult(System.out);
transformer.transform(source, result);
Note: Provide some XSLT source to tFact.newTransformer()
to do something more useful than the identity transform.