The bliki engine can do this, in two steps. See info.bliki.wiki / Home
- How to convert HTML to Mediawiki text -- nediawiki text is already a rather plain text format, but you can convert it further
- How to convert Mediawiki text to plain text -- your goal.
It will be some 7-8 lines of code, like this:
// html to wiki
import info.bliki.html.HTML2WikiConverter;
import info.bliki.html.wikipedia.ToWikipedia;
// wiki to plain text
import info.bliki.wiki.filter.PlainTextConverter;
import info.bliki.wiki.model.WikiModel;
...
String sbodyhtml = readFile( infilepath ); //get content as string
HTML2WikiConverter conv = new HTML2WikiConverter();
conv.setInputHTML( sbodyhtml );
String resultwiki = conv.toWiki(new ToWikipedia());
WikiModel wikiModel = new WikiModel("${image}", "${title}");
String plainStr = wikiModel.render(new PlainTextConverter(false), resultwiki );
System.out.println( plainStr );
Jsoup can do this simpler:
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
...
Document doc = Jsoup.parse(sbodyhtml);
String plainStr = doc.body().text();
but in the result you lose all paragraph formatting -- there will be no any newlines.
``` tags and similar), there is an example in the Jsoup repository, which is a great starting point: [HtmlToPlainText.java](https://github.com/jhy/jsoup/blob/master/src/main/java/org/jsoup/examples/HtmlToPlainText.java)
– Till F. May 15 '18 at 13:52