I need to transform a HTML file, by removing certain tags from the file. To do this I have something like this -
import org.jsoup.Jsoup;
import org.jsoup.helper.Validate;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Entities;
import org.jsoup.nodes.Entities.EscapeMode;
import java.io.IOException;
import java.io.File;
import java.util.*;
public class TestJsoup {
public static void main(String[] args) throws IOException {
Validate.isTrue(args.length == 1, "usage: supply url to fetch");
String url = args[0];
Document doc = null;
if(url.contains("http")) {
doc = Jsoup.connect(url).get();
} else {
File f = new File(url);
doc = Jsoup.parse(f, null);
}
/* remove some tags */
doc.outputSettings().escapeMode(Entities.EscapeMode.extended);
System.out.println(doc.html());
return;
}
}
The issue with the above code is that, when I use extended escape mode, the output has the html tag attributes being html encoded. Is there anyway to avoid this? Using escape mode as base or xhtml doesn't work as some of the non standard extended (like ’
) encoding give problems. For ex for the HTML below,
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html>
<head>
<title>Test®</title>
</head>
<body style="background-color:#EDEDED;">
<P>
<font style="color:#003698; font-weight:bold;">Testing HTML encoding - ’ © with a <a href="http://www.google.com">link</a>
</font>
<br />
</P>
</body>
</html>
The output I get is,
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html>
<head>

<title>Test®</title>

</head>

<body style="background-color:#EDEDED;">

<p>
 <font style="color:#003698; font-weight:bold;">Testing HTML encoding - ’ © with a <a href="http://www.g
oogle.com">link</a></font> <br />
</p>




</body>
</html>
Is there anyway to get around this issue?