HTML processing in Java: Convert HTML to other formats

Question

OK, there are many HTML/XML parsers for Java. What I want to do is a bit more than just knowing how to parse it. I want to filter the content and have it in suitable form.

More precisely, I want to keep only the text and images. However, I want to preserve some of the text formatting, too, like: italic, bold, alignment, etc.

All this is for the reason that I'm trying to implement a converter that converts html to a specific format that I've created myself for my own purposes.

Any ideas? Surely, it must have been done many times before.

score 5 · Answer 1 · edited Jun 20 '20 at 09:12

5

If your intent is to clean user-submitted content against a safe white-list to prevent XSS, then I'd suggest to use Jsoup for this. It provides a builtin white-list. It's then as simple as:

String safeHtml = Jsoup.clean(unsafeHtml, Whitelist.basicWithImages());

You can customize the Whitelist as described in its javadoc.

1

Damn, this JSoup is really well thought. +1 – Pascal Thivent Oct 02 '10 at 16:06
Thanks. The link turned out to be ***very*** useful! As I *said* I am trying to convert HTML to my custom format. Jsoup is quite promising, but HtmlUnit *is* quite close to the point! Thanks a lot! – Albus Dumbledore Oct 02 '10 at 20:49
You're welcome :) After cleaning you could use Jsoup as well to iterate over all HTML elements and convert/transform each into another markup. You can also do this with XSLT, it may only end up to be pretty complex since you've to specify every HTML element and/or attribute separately. – BalusC Oct 02 '10 at 21:07

score 2 · Answer 2 · answered Oct 02 '10 at 15:25

2

JTidy + XSLT?

answered Oct 02 '10 at 15:25

Denis Tulskiy

19,012
6
50
68

score 1 · Answer 3 · answered Oct 02 '10 at 11:56

1

Have a look at HTML Parser, it could be handy.

answered Oct 02 '10 at 11:56

George Profenza

50,687
19
144
218

score 0 · Accepted Answer · answered Oct 02 '10 at 11:53

O.K. I think found it out: when parsing the Element I can construct a javax.swing.text.html.InlineView, i.e. InlineElement ie = new InlineView(element) and then get the attributes as ie.getAttributes).

Right. If you could help more, i.e. have some first-hand experience to share, please do!

score 0 · Answer 5 · answered Oct 02 '10 at 15:19

you can use xml dom parser under packages org.w3c.dom and javax.xml with that you can easily parse the document and get the node contents

 Document doc = DocumentBuilder.parse(file);

and then get the elements by using

NodeList nl = doc.getElementsByTagName("p"); // for paragraph tags

and then get the content from nodelist, it'll give u whole content in paragraph tag, like that you can apply for any tag

HTML processing in Java: Convert HTML to other formats

5 Answers5

See also: