2

Suppose I have a div as such:

<div>
This is a paragraph
written by someone
on the internet.
</div>

The problem is that when JSoup parses this, it puts it all on one line, so that when I call text() it reads as such:

This is a paragraphwritten by someoneon the internet.

Now, I realize this isn't really a JSoup problem, in that the actual html doesn't contain a space. However, is there any way to use JSoup (perhaps some override or maybe an option I haven't seen) so that as it parses it will add a space between lines? I imagine it must be possible (as I can inspect element in Chrome and unselect word wrap and it gets what I want) but I'm not sure JSoup can do this.

Any thoughts?

AHungerArtist
  • 9,332
  • 17
  • 73
  • 109

2 Answers2

3

Can you provide a full example of your code? What version of jsoup are you using?

In the current version (1.6.1), this code:

Document doc = Jsoup.parse("<div>\n" +
    "This is a paragraph\n" +
    "written by someone\n" +
    "on the internet.\n" +
    "</div>");
System.out.println(doc.text());

Produces:

This is a paragraph written by someone on the internet.

I.e., \n (and \r\n etc) are converted to text as spaces.

Happy to fix or improve it, if I can replicate :)

Jonathan Hedley
  • 10,442
  • 3
  • 36
  • 47
  • Actually, Jsoup seems to be perfectly fine, it's the way I was reading in the data (btw, JSoup is pretty awesome). If I'd just done a simple test like the above, I would have known to look elsewhere sooner. However, I am curious now, is it possible to have JSoup not parse the new lines? – AHungerArtist Aug 29 '11 at 09:22
  • 2
    Yes the newlines are retained, and only normalised on .text() output. You can get at them by accessing the TextNode for the text, and hitting .getWholeText() - http://jsoup.org/apidocs/org/jsoup/nodes/TextNode.html#getWholeText() – Jonathan Hedley Aug 29 '11 at 22:45
2

the following post shows how you get everything including the line break

Removing HTML entities while preserving line breaks with JSoup

the answer and comment in the following also has another way (read the comment in it)

Remove HTML tags from a String

and this one has even another way if you check all the answers and the comments

How do I preserve line breaks when using jsoup to convert html to plain text?

Community
  • 1
  • 1