This is not a duplicate. The was a similar question, but none of those answers are able to deal with a real html file. One can save any html, even this one and try to run any of the solutions to that answer ... none of them solves the problem completely
The question is
I have a saved .htm
file on my desktop. I need to get pure text from it . However I do need to keep the line breaks so that the text is not on just one or couple of lines.
I tried the following and all methods from here
FileInputStream in = new FileInputStream("C:\\...myfile.htm");
String htmlText = IOUtils.toString(in);
for (String line : htmlText.split("\n")) {
String stripped = Jsoup.parse(line).text();
System.out.println(stripped);
}
This does preserve only lines of html file. However, the text is still messed up, because such things as </br>
, <p>
got removed. How can I parse so that the text preserves all natural line breaks.