4

When i am parsing a HTML file using jsoup, texts in multiple lines (with < br />) in the HTML file is presented as a single line without new lines(\n). How i can parse the multi line HTML document as multiline strings ??

I am using the method: Element.text()

Eg:

HTML contains C code which is properly displayed in multiple lines in HtMl file, but when i am taking the text data, all the data are presented in a single line without new line charactors.

madth3
  • 7,275
  • 12
  • 50
  • 74

2 Answers2

3

Replace <br /> with something else and back, like this:

Document doc = Jsoup.connect("http://www.ejemplo.html").get(); //Here included the <br>'s
String temp = doc.html().replace("<br />", "$$$"); //$$$ instead <br>
doc = Jsoup.parse(temp); //Parse again

String text = doc.body().text().replace("$$$", "\n").toString()); //example
//I get back the new lines (\n)
Garrett Hyde
  • 5,409
  • 8
  • 49
  • 55
acrux
  • 31
  • 2
0

The text() method of Element (and TextNode) calls appendWhitespaceIfBr(...) which will replace every <br /> (or whitespace) with a blank. Unfortunately i see no mechanism for turning this off without working on the code.

But maybe you can try replacing all <br /> Tags with a new subclass of Node.

ollo
  • 24,797
  • 14
  • 106
  • 155