0

I have text that may contain HTML islands.

Example:

qwwdeadaskdfdaskjfhbsdfkf<a href="/cookbook/modifying-data/set-attributes">Set attribute values</a>gfkjgfkjrgjgjgjgjgroggjrog <b>jsoup</b>sdflkjsdfsfklsfklfjsfkljsfljsf<a href="/apidocs/org/jsoup/Jsoup.html#parse(java.lang.String)" title="Parse HTML into a Document.">Jsoup.parse(String html)</a>skgjdfgkjdfgkldfjgdfkgljdfg

How can I extract those HTML fragments?

gen_Eric
  • 223,194
  • 41
  • 299
  • 337
balderman
  • 22,927
  • 7
  • 34
  • 52

2 Answers2

0

Java supports both DOM and SAX parsing for XML, however they both require the document to be well-formed. Therefore your example would not be parsed. There is a project called NekoHTML (http://nekohtml.sourceforge.net/) that supports scanning non well-formed HTML.

LINEMAN78
  • 2,562
  • 16
  • 19
0

I do exactly what you are asking -- find HTML fragments in a chunk of text -- by wrapping an enclosing tag around the text then using a java.xml.parsers.DocumentBuilder to create a DOM tree.

The basic idea (and omitting much) is just

String fragment = "<wrap_node>" + orig_text + "</wrap_node>";
Document d = builder.parse(fragment);

If tags aren't well-formed... missing end, improper nesting, etc. ... this won't work, but this works for me because I want to reject anything malformed.

Stephen P
  • 14,422
  • 2
  • 43
  • 67