22

Can you recommend an open source Java library (preferably ASL/BSD/LGPL license) that converts HTML to plain text - cleans all the tags, converts entities (&,  , etc.) and handles <br> and tables properly.

More Info

I have the HTML as a string, there's no need to fetch it from the web. Also, what I'm looking is for a method like this:

String convertHtmlToPlainText(String html)
David Rabinowitz
  • 29,904
  • 14
  • 93
  • 125
  • 2
    Also [jsoup](http://jsoup.org/) is mentioned [here](http://stackoverflow.com/questions/9631477/retrieve-text-from-html-file-in-java), which is distributed under the liberal [MIT license](http://jsoup.org/license). – cubanacan Oct 09 '13 at 15:32
  • By the way, jsoup supports HTML5 – cubanacan Oct 09 '13 at 15:44
  • At least according the documentation it does not do what I've asked (convert the page to plain text, NOT HTML manipulation) – David Rabinowitz Oct 10 '13 at 07:00
  • 5
    Here you are `Jsoup.parse(html).text()` – cubanacan Oct 10 '13 at 22:33
  • @cubanacan Thanks, good to know there is another alternative – David Rabinowitz Oct 14 '13 at 07:07
  • +1 for Jsoup! And if you are looling for some "light" formatting of the output text (e.g. line breaks around ```

    ``` tags and similar), there is an example in the Jsoup repository, which is a great starting point: [HtmlToPlainText.java](https://github.com/jhy/jsoup/blob/master/src/main/java/org/jsoup/examples/HtmlToPlainText.java)

    – Till F. May 15 '18 at 13:52

5 Answers5

21

Try Jericho.

The TextExtractor class sounds like it will do what you want. Sorry can't post a 2nd link as I'm a new user but scroll down the homepage a bit and there's a link to it.

рüффп
  • 5,172
  • 34
  • 67
  • 113
Chris R
  • 2,464
  • 3
  • 25
  • 31
3

HtmlUnit, it even shows the page after processing JavaScript / Ajax.

Sean Patrick Floyd
  • 292,901
  • 67
  • 465
  • 588
Ahmed Ashour
  • 5,179
  • 10
  • 35
  • 56
2

The bliki engine can do this, in two steps. See info.bliki.wiki / Home

  1. How to convert HTML to Mediawiki text -- nediawiki text is already a rather plain text format, but you can convert it further
  2. How to convert Mediawiki text to plain text -- your goal.

It will be some 7-8 lines of code, like this:

// html to wiki
import info.bliki.html.HTML2WikiConverter;
import info.bliki.html.wikipedia.ToWikipedia;
// wiki to plain text
import info.bliki.wiki.filter.PlainTextConverter;
import info.bliki.wiki.model.WikiModel;
...
String sbodyhtml = readFile( infilepath ); //get content as string
  HTML2WikiConverter conv = new HTML2WikiConverter();
  conv.setInputHTML( sbodyhtml );
String resultwiki = conv.toWiki(new ToWikipedia());
  WikiModel wikiModel = new WikiModel("${image}", "${title}");
String plainStr = wikiModel.render(new PlainTextConverter(false), resultwiki );
System.out.println( plainStr );

Jsoup can do this simpler:

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
...
Document doc = Jsoup.parse(sbodyhtml);
String plainStr = doc.body().text();

but in the result you lose all paragraph formatting -- there will be no any newlines.

Pkunk
  • 71
  • 3
0

I use TagSoup, it is available for several languages and does a really good job with HTML found "in the wild". It produces either a cleaned up version of the HTML or XML, that you can then process with some DOM/SAX parser.

Rich Seller
  • 83,208
  • 23
  • 172
  • 177
-1

I've used Apache Commons Lang to go the other way. But it looks like it can do what you need via StringEscapeUtils.

firefly2442
  • 557
  • 8
  • 18