Open source java library for HTML to text conversion

Question

Can you recommend an open source Java library (preferably ASL/BSD/LGPL license) that converts HTML to plain text - cleans all the tags, converts entities (&, , etc.) and handles <br> and tables properly.

More Info

I have the HTML as a string, there's no need to fetch it from the web. Also, what I'm looking is for a method like this:

String convertHtmlToPlainText(String html)

Also [jsoup](http://jsoup.org/) is mentioned [here](http://stackoverflow.com/questions/9631477/retrieve-text-from-html-file-in-java), which is distributed under the liberal [MIT license](http://jsoup.org/license). — cubanacan, Oct 09 '13 at 15:32
At least according the documentation it does not do what I've asked (convert the page to plain text, NOT HTML manipulation) — David Rabinowitz, Oct 10 '13 at 07:00
@cubanacan Thanks, good to know there is another alternative — David Rabinowitz, Oct 14 '13 at 07:07
+1 for Jsoup! And if you are looling for some "light" formatting of the output text (e.g. line breaks around ```
``` tags and similar), there is an example in the Jsoup repository, which is a great starting point: [HtmlToPlainText.java](https://github.com/jhy/jsoup/blob/master/src/main/java/org/jsoup/examples/HtmlToPlainText.java) — Till F., May 15 '18 at 13:52

score 21 · Accepted Answer · edited Sep 03 '13 at 08:47

21

Try Jericho.

The TextExtractor class sounds like it will do what you want. Sorry can't post a 2nd link as I'm a new user but scroll down the homepage a bit and there's a link to it.

edited Sep 03 '13 at 08:47

рüффп

5,172
34
67
113

answered Oct 05 '09 at 12:14

Chris R

2,464
3
25
31

Here's the link to that class: http://jericho.htmlparser.net/docs/javadoc/net/htmlparser/jericho/TextExtractor.html – Chris R Oct 05 '09 at 12:15
5

Thanks! I actually used the Renderer at the end – David Rabinowitz Oct 05 '09 at 13:40
2

For the lazy: `String plainText = new Source(html).getRenderer().toString();` – Mike Gleason jr Couturier Jan 03 '18 at 15:24

score 3 · Answer 2 · edited Jan 26 '16 at 14:44

3

HtmlUnit, it even shows the page after processing JavaScript / Ajax.

edited Jan 26 '16 at 14:44

Sean Patrick Floyd

292,901
67
465
588

answered Oct 05 '09 at 07:37

Ahmed Ashour

5,179
10
35
56

I see how it gives me the response as HTML, not text – David Rabinowitz Oct 05 '09 at 08:07
Check .asText() [http://htmlunit.sourceforge.net/apidocs/com/gargoylesoftware/htmlunit/html/DomNode.html#asText()] – Ahmed Ashour Oct 05 '09 at 08:16
Thanks. I went for Jericho at the end, but I'll keep HtmlUnit in mind – David Rabinowitz Oct 05 '09 at 19:13

Pkunk · Answer 3 · 2016-04-03T10:55:54.787

The bliki engine can do this, in two steps. See info.bliki.wiki / Home

How to convert HTML to Mediawiki text -- nediawiki text is already a rather plain text format, but you can convert it further
How to convert Mediawiki text to plain text -- your goal.

It will be some 7-8 lines of code, like this:

// html to wiki
import info.bliki.html.HTML2WikiConverter;
import info.bliki.html.wikipedia.ToWikipedia;
// wiki to plain text
import info.bliki.wiki.filter.PlainTextConverter;
import info.bliki.wiki.model.WikiModel;
...
String sbodyhtml = readFile( infilepath ); //get content as string
  HTML2WikiConverter conv = new HTML2WikiConverter();
  conv.setInputHTML( sbodyhtml );
String resultwiki = conv.toWiki(new ToWikipedia());
  WikiModel wikiModel = new WikiModel("${image}", "${title}");
String plainStr = wikiModel.render(new PlainTextConverter(false), resultwiki );
System.out.println( plainStr );

Jsoup can do this simpler:

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
...
Document doc = Jsoup.parse(sbodyhtml);
String plainStr = doc.body().text();

but in the result you lose all paragraph formatting -- there will be no any newlines.

score 0 · Answer 4 · answered Oct 05 '09 at 07:57

0

I use TagSoup, it is available for several languages and does a really good job with HTML found "in the wild". It produces either a cleaned up version of the HTML or XML, that you can then process with some DOM/SAX parser.

answered Oct 05 '09 at 07:57

Rich Seller

83,208
23
172
177

Thanks, but I need the final result in plain text – David Rabinowitz Oct 05 '09 at 08:08
Once it is in XML, you can implement a SAX parser to output only the text nodes (e.g. a DefaultHandler no-op implementations of all methods apart from `characters`) – Rich Seller Oct 05 '09 at 08:38

score -1 · Answer 5 · answered Feb 26 '13 at 18:41

-1

I've used Apache Commons Lang to go the other way. But it looks like it can do what you need via StringEscapeUtils.

answered Feb 26 '13 at 18:41

firefly2442

557
8
18

I can't find any htmlToText() method - there are escaping of the HTML methods so that "hello" will be converted to "<b>hello</b>" instead of to "hello" – David Rabinowitz Feb 27 '13 at 07:10
Ahh, yes, I didn't see you wanted plain text. This is true. – firefly2442 Feb 27 '13 at 19:24

Open source java library for HTML to text conversion

5 Answers5

Linked