Problems using extended escape mode for jsoup output

Question

I need to transform a HTML file, by removing certain tags from the file. To do this I have something like this -

import org.jsoup.Jsoup;
import org.jsoup.helper.Validate;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Entities;
import org.jsoup.nodes.Entities.EscapeMode;

import java.io.IOException;
import java.io.File;
import java.util.*;

public class TestJsoup {
    public static void main(String[] args) throws IOException {
        Validate.isTrue(args.length == 1, "usage: supply url to fetch");
        String url = args[0];

        Document doc = null;
        if(url.contains("http")) {
           doc = Jsoup.connect(url).get();
        } else {
           File f = new File(url);
           doc = Jsoup.parse(f, null);
        }

        /* remove some tags */

        doc.outputSettings().escapeMode(Entities.EscapeMode.extended);
        System.out.println(doc.html());

        return;
    }
}

The issue with the above code is that, when I use extended escape mode, the output has the html tag attributes being html encoded. Is there anyway to avoid this? Using escape mode as base or xhtml doesn't work as some of the non standard extended (like ’) encoding give problems. For ex for the HTML below,

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html>
<head>
<title>Test&reg;</title>
</head>
<body style="background-color:#EDEDED;">
<P>
   <font style="color:#003698; font-weight:bold;">Testing HTML encoding - &rsquo; &copy; with a <a href="http://www.google.com">link</a>
   </font> 
   <br />
</P>
</body>
</html>

The output I get is,

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html>
 <head>&NewLine;
  <title>Test&reg;</title>&NewLine;
 </head>&NewLine;
 <body style="background-color&colon;&num;EDEDED&semi;">&NewLine;
  <p>&NewLine; <font style="color&colon;&num;003698&semi; font-weight&colon;bold&semi;">Testing HTML encoding - &rsquor; &copy; with a <a href="http&colon;&sol;&sol;www&period;g
oogle&period;com">link</a></font> <br />&NewLine;</p>&NewLine;&NewLine;&NewLine;&NewLine;
 </body>
</html>

Is there anyway to get around this issue?

score 8 · Accepted Answer · answered Jul 16 '11 at 01:31

What output encoding character set are you using? (It will default to the input, which if you are loading from URLs, will vary according to the site).

You probably want to explicitly set it to either UTF-8, or ASCII or some other low setting if you are working with systems that cannot deal with UTF-8. If you set the escape mode to base (the default), and the charset to ascii, then any character (like rsquo) than cannot be represented natively in the selected charset will be output as a numerical escape.

For example:

String check = "<p>&rsquo; <a href='../'>Check</a></p>";
Document doc = Jsoup.parse(check);
doc.outputSettings().escapeMode(Entities.EscapeMode.base); // default

doc.outputSettings().charset("UTF-8");
System.out.println("UTF-8: " + doc.body().html());

doc.outputSettings().charset("ASCII");
System.out.println("ASCII: " + doc.body().html());

Gives:

UTF-8: <p>’ <a href="../">Check</a></p>
ASCII: <p>&#8217; <a href="../">Check</a></p>

Hope this helps!

Any idea how I can prevent the `&` from being escaped? It seem to be escaped in any character set and and also when the escape mode has been set to `Entities.EscapeMode.xhtml`. — Randy, Jul 09 '17 at 11:41
`&` needs to always be escaped to produce valid HTML / XML, so there's not an option to disable that. — Jonathan Hedley, Jul 09 '17 at 16:08

Problems using extended escape mode for jsoup output

1 Answers1

Linked