Extracting HTML fragments in Java

Question

I have text that may contain HTML islands.

Example:

qwwdeadaskdfdaskjfhbsdfkf<a href="/cookbook/modifying-data/set-attributes">Set attribute values</a>gfkjgfkjrgjgjgjgjgroggjrog <b>jsoup</b>sdflkjsdfsfklsfklfjsfkljsfljsf<a href="/apidocs/org/jsoup/Jsoup.html#parse(java.lang.String)" title="Parse HTML into a Document.">Jsoup.parse(String html)</a>skgjdfgkjdfgkldfjgdfkgljdfg

How can I extract those HTML fragments?

What defines the boundaries between HTML text and not-HTML text? — Ira Baxter, Mar 05 '12 at 16:59
Whatever you do [don't consider regular expressions](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags). — , Mar 05 '12 at 17:24

score 0 · Answer 1 · answered Mar 05 '12 at 17:15

Java supports both DOM and SAX parsing for XML, however they both require the document to be well-formed. Therefore your example would not be parsed. There is a project called NekoHTML (http://nekohtml.sourceforge.net/) that supports scanning non well-formed HTML.

score 0 · Answer 2 · answered Mar 05 '12 at 17:24

I do exactly what you are asking -- find HTML fragments in a chunk of text -- by wrapping an enclosing tag around the text then using a java.xml.parsers.DocumentBuilder to create a DOM tree.

The basic idea (and omitting much) is just

String fragment = "<wrap_node>" + orig_text + "</wrap_node>";
Document d = builder.parse(fragment);

If tags aren't well-formed... missing end, improper nesting, etc. ... this won't work, but this works for me because I want to reject anything malformed.

Extracting HTML fragments in Java

2 Answers2