2

I am having a problem replacing line breaks inside all <pre> elements in a given HTML using Jsoup. Here is what I tried so far, and the problem I am facing. I am trying to replace all the \n characters with <br> for the innerHtml in all the <pre> tags only. I want to leave the rest of the content as it is. The code is:

String body = "<p>This is the output:</p>\n<pre class=\"lang-xml prettyprint prettyprinted\">\n<code><span class=\"dec\">&lt;!DOCTYPE HTML PUBLIC \"-//W3C//DTD HTML 4.01 Transitional//EN\" \"http://www.w3.org/TR/html4/loose.dtd\"&gt;</span><span class=\"pln\">\n</span><span class=\"tag\">&lt;HTML&gt;</span><span class=\"pln\">\n    </span><span class=\"tag\">&lt;HEAD&gt;</span><span class=\"pln\">\n        </span><span class=\"tag\">&lt;META</span><span class=\"pln\"> </span><span class=\"atn\">http-equiv</span><span class=\"pun\">=</span><span class=\"atv\">\"Content-Type\"</span><span class=\"pln\"> </span><span class=\"atn\">content</span><span class=\"pun\">=</span><span class=\"atv\">\"text/html; charset=iso-8859-1\"</span><span class=\"tag\">&gt;</span><span class=\"pln\">\n        </span><span class=\"tag\">&lt;TITLE&gt;</span><span class=\"pln\">GeteBayOfficialTime</span><span class=\"tag\">&lt;/TITLE&gt;</span><span class=\"pln\">\n    </span><span class=\"tag\">&lt;/HEAD&gt;</span><span class=\"pln\">\n    </span><span class=\"tag\">&lt;BODY&gt;</span><span class=\"pln\">\n\n* About to connect() to api.ebay.com port 443 (#0)\n*   Trying 66.135.211.100... * Timeout\n*   Trying 66.135.211.140... * Timeout\n*   Trying 66.211.179.150... * Timeout\n*   Trying 66.211.179.180... * Timeout\n*   Trying 66.135.211.101... * Timeout\n*   Trying 66.211.179.148... * Timeout\n* connect() timed out!\n* Closing connection #0\n</span><span class=\"tag\">&lt;P&gt;</span><span class=\"pln\">Error sending request</span></code></pre>";
            log.info("printing before creating a Jsoup Doc "+  body);
            Document bodyDom = Jsoup.parse(body);
            log.info("printing after creating a Jsoup Doc "+  bodyDom.html());

            Elements preTags = bodyDom.getElementsByTag("pre");

            for (Element pre : preTags) {
                pre.html(pre.html().replaceAll("(\r\n|\n)", "<br />"));
                log.info("Pre element with linebreaks replaced -" + pre);
            }

            body = bodyDom.html();

Here is the log, seems like the html source is losing newline characters once I parse the Jsoup document. :

**2013-12-10 10:14:59 INFO  FormattingTest:166** - printing before creating a Jsoup Doc <p>This is the output:</p>
<pre class="lang-xml prettyprint prettyprinted">
<code><span class="dec">&lt;!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd"&gt;</span><span class="pln">
</span><span class="tag">&lt;HTML&gt;</span><span class="pln">
    </span><span class="tag">&lt;HEAD&gt;</span><span class="pln">
        </span><span class="tag">&lt;META</span><span class="pln"> </span><span class="atn">http-equiv</span><span class="pun">=</span><span class="atv">"Content-Type"</span><span class="pln"> </span><span class="atn">content</span><span class="pun">=</span><span class="atv">"text/html; charset=iso-8859-1"</span><span class="tag">&gt;</span><span class="pln">
        </span><span class="tag">&lt;TITLE&gt;</span><span class="pln">GeteBayOfficialTime</span><span class="tag">&lt;/TITLE&gt;</span><span class="pln">
    </span><span class="tag">&lt;/HEAD&gt;</span><span class="pln">
    </span><span class="tag">&lt;BODY&gt;</span><span class="pln">

* About to connect() to api.ebay.com port 443 (#0)
*   Trying 66.135.211.100... * Timeout
*   Trying 66.135.211.140... * Timeout
*   Trying 66.211.179.150... * Timeout
*   Trying 66.211.179.180... * Timeout
*   Trying 66.135.211.101... * Timeout
*   Trying 66.211.179.148... * Timeout
* connect() timed out!
* Closing connection #0
</span><span class="tag">&lt;P&gt;</span><span class="pln">Error sending request</span></code></pre>


**2013-12-10 10:14:59 INFO  FormattingTest:168** - printing after creating a Jsoup Doc <html>
 <head></head>
 <body>
  <p>This is the output:</p> 
  <pre class="lang-xml prettyprint prettyprinted">
<code><span class="dec">&lt;!DOCTYPE HTML PUBLIC &quot;-//W3C//DTD HTML 4.01 Transitional//EN&quot; &quot;http://www.w3.org/TR/html4/loose.dtd&quot;&gt;</span><span class="pln"> </span><span class="tag">&lt;HTML&gt;</span><span class="pln"> </span><span class="tag">&lt;HEAD&gt;</span><span class="pln"> </span><span class="tag">&lt;META</span><span class="pln"> </span><span class="atn">http-equiv</span><span class="pun">=</span><span class="atv">&quot;Content-Type&quot;</span><span class="pln"> </span><span class="atn">content</span><span class="pun">=</span><span class="atv">&quot;text/html; charset=iso-8859-1&quot;</span><span class="tag">&gt;</span><span class="pln"> </span><span class="tag">&lt;TITLE&gt;</span><span class="pln">GeteBayOfficialTime</span><span class="tag">&lt;/TITLE&gt;</span><span class="pln"> </span><span class="tag">&lt;/HEAD&gt;</span><span class="pln"> </span><span class="tag">&lt;BODY&gt;</span><span class="pln"> * About to connect() to api.ebay.com port 443 (#0) * Trying 66.135.211.100... * Timeout * Trying 66.135.211.140... * Timeout * Trying 66.211.179.150... * Timeout * Trying 66.211.179.180... * Timeout * Trying 66.135.211.101... * Timeout * Trying 66.211.179.148... * Timeout * connect() timed out! * Closing connection #0 </span><span class="tag">&lt;P&gt;</span><span class="pln">Error sending request</span></code></pre>
 </body>
</html>
2013-12-10 10:14:59 INFO  FormattingTest:174 - Pre element with linebreaks replaced -  <pre class="lang-xml prettyprint prettyprinted"><code><span class="dec">&lt;!DOCTYPE HTML PUBLIC &quot;-//W3C//DTD HTML 4.01 Transitional//EN&quot; &quot;http://www.w3.org/TR/html4/loose.dtd&quot;&gt;</span><span class="pln"> </span><span class="tag">&lt;HTML&gt;</span><span class="pln"> </span><span class="tag">&lt;HEAD&gt;</span><span class="pln"> </span><span class="tag">&lt;META</span><span class="pln"> </span><span class="atn">http-equiv</span><span class="pun">=</span><span class="atv">&quot;Content-Type&quot;</span><span class="pln"> </span><span class="atn">content</span><span class="pun">=</span><span class="atv">&quot;text/html; charset=iso-8859-1&quot;</span><span class="tag">&gt;</span><span class="pln"> </span><span class="tag">&lt;TITLE&gt;</span><span class="pln">GeteBayOfficialTime</span><span class="tag">&lt;/TITLE&gt;</span><span class="pln"> </span><span class="tag">&lt;/HEAD&gt;</span><span class="pln"> </span><span class="tag">&lt;BODY&gt;</span><span class="pln"> * About to connect() to api.ebay.com port 443 (#0) * Trying 66.135.211.100... * Timeout * Trying 66.135.211.140... * Timeout * Trying 66.211.179.150... * Timeout * Trying 66.211.179.180... * Timeout * Trying 66.135.211.101... * Timeout * Trying 66.211.179.148... * Timeout * connect() timed out! * Closing connection #0 </span><span class="tag">&lt;P&gt;</span><span class="pln">Error sending request</span></code></pre>

Not sure what's wrong. This is working with another html source - "\nResponse :\n some thext \n \ndsjkhskjdh sdjhasjkdas \n"

Gets properly converted to -


Response :
some text

dsjkhskjdh sdjhasjkdas

Not sure why the first sample doesnt !!

Nicktar
  • 5,548
  • 1
  • 28
  • 43
Eswar Rajesh Pinapala
  • 4,841
  • 4
  • 32
  • 40

1 Answers1

3

The problem is when you try to do this:

    Jsoup.parse("\nText\nNex").html();

you get:

    text nex

From this questions, you can do this:

    Document bodyDom = Jsoup.parse(body.replaceAll("(\\r\\n|\\n)", "<br />"));

That's replace the linebreak before parse the document.

Only replace <pre> tags

For replace only the linebreaks between two pre tags, use a regular expression to extract they and replace:

    Pattern preP = Pattern.compile("<pre.*?>.+?</pre>", Pattern.DOTALL
            | Pattern.CASE_INSENSITIVE);
    Matcher m = preP.matcher(body);
    while (m.find()) {
        String toReplace = m.group();
        String replaced = toReplace.replaceAll("(\r\n|\n)", "<br />");
        body = body.replace(toReplace, replaced);
    }

The .+* is a greedy qualifiers, it match the first appearance of /pre, You can try with regex, but it's not possible, see this answers for a better explanation. I recommend you to use the next option.

You can see examples of the regex here.

clean the string before parse

From the second asnwers you can use:

    Document.OutputSettings outputSettings = new Document.OutputSettings()
            .prettyPrint(false);
    body = Jsoup.clean(body, "", Whitelist.relaxed(), outputSettings);

and after (as in your original code):

    pre.html(pre.html().replaceAll("(\r\n|\n)", "<br />"));

The prettyPrint option make the clean method to escape the linebreaks, and later the parser handle it correctly

Cheers

Community
  • 1
  • 1
Arturo Volpe
  • 3,442
  • 3
  • 25
  • 40
  • Hi, Thanks for your answer. However This does not help because - If you look at the code I am looking to replace the line breaks with
    only inside the Pre elements. replacing them before parsing with Jsoup will defeat the purpose as I will be loosing all the line breaks well before I get to parse the Pre elements.
    – Eswar Rajesh Pinapala Dec 13 '13 at 02:34
  • Hi, I added more code to work with your case, in my local test it work's, please tell me if I am wrong. Cheers – Arturo Volpe Dec 13 '13 at 10:54
  • I really appreciate your effort. quick question, Does this take care of pre elements with any attributes or no attributes at all? I will give this a try now. – Eswar Rajesh Pinapala Dec 13 '13 at 19:10
  • Thanks for your answer, the regex solution worked well. Only question I now have is , does this match any pre elements with any attributes or with no attributes? also does this process the pre elements which are nested? – Eswar Rajesh Pinapala Dec 13 '13 at 19:33
  • Hi Eswar, the `.*?` part of the regular expression with the tag make it to match any number of attributes (and with no attributes). And I im trying it now with nested `pre's` and it's not working, I will update mi answers with the correct version. Other big problem is with unbalanced `pre`, the regex only match balanced ones. – Arturo Volpe Dec 13 '13 at 21:45
  • Thanks for the Answer, by unbalanced Pre , do you mean any Pre element having a starting tag but not ending tag? If so, I don't need to worry in my usecase. – Eswar Rajesh Pinapala Dec 13 '13 at 22:49
  • Yes, the regular expressions don't work good with html, see my updated answer!. Cheers and good luck! – Arturo Volpe Dec 13 '13 at 23:32
  • Thanks for the answer and detailed description, unfortunately only the first option worked for me. I will not be able to use the second option. – Eswar Rajesh Pinapala Dec 14 '13 at 01:30
  • just realized I dint award the bounty. Just did. Cheers :) – Eswar Rajesh Pinapala Dec 17 '13 at 03:43