2

OK, there are many HTML/XML parsers for Java. What I want to do is a bit more than just knowing how to parse it. I want to filter the content and have it in suitable form.

More precisely, I want to keep only the text and images. However, I want to preserve some of the text formatting, too, like: italic, bold, alignment, etc.

All this is for the reason that I'm trying to implement a converter that converts html to a specific format that I've created myself for my own purposes.

Any ideas? Surely, it must have been done many times before.

Charles
  • 50,943
  • 13
  • 104
  • 142
Albus Dumbledore
  • 12,368
  • 23
  • 64
  • 105

5 Answers5

5

If your intent is to clean user-submitted content against a safe white-list to prevent XSS, then I'd suggest to use Jsoup for this. It provides a builtin white-list. It's then as simple as:

String safeHtml = Jsoup.clean(unsafeHtml, Whitelist.basicWithImages());

You can customize the Whitelist as described in its javadoc.

See also:

Community
  • 1
  • 1
BalusC
  • 1,082,665
  • 372
  • 3,610
  • 3,555
  • 1
    Damn, this JSoup is really well thought. +1 – Pascal Thivent Oct 02 '10 at 16:06
  • Thanks. The link turned out to be ***very*** useful! As I *said* I am trying to convert HTML to my custom format. Jsoup is quite promising, but HtmlUnit *is* quite close to the point! Thanks a lot! – Albus Dumbledore Oct 02 '10 at 20:49
  • You're welcome :) After cleaning you could use Jsoup as well to iterate over all HTML elements and convert/transform each into another markup. You can also do this with XSLT, it may only end up to be pretty complex since you've to specify every HTML element and/or attribute separately. – BalusC Oct 02 '10 at 21:07
2

JTidy + XSLT?

Denis Tulskiy
  • 19,012
  • 6
  • 50
  • 68
1

Have a look at HTML Parser, it could be handy.

George Profenza
  • 50,687
  • 19
  • 144
  • 218
0

O.K. I think found it out: when parsing the Element I can construct a javax.swing.text.html.InlineView, i.e. InlineElement ie = new InlineView(element) and then get the attributes as ie.getAttributes).

Right. If you could help more, i.e. have some first-hand experience to share, please do!

Albus Dumbledore
  • 12,368
  • 23
  • 64
  • 105
0

you can use xml dom parser under packages org.w3c.dom and javax.xml with that you can easily parse the document and get the node contents

 Document doc = DocumentBuilder.parse(file);

and then get the elements by using

NodeList nl = doc.getElementsByTagName("p"); // for paragraph tags

and then get the content from nodelist, it'll give u whole content in paragraph tag, like that you can apply for any tag

karthi
  • 2,762
  • 4
  • 30
  • 28