I'm using Jsoup to remove all the images from an HTML page. I'm receiving the page through an HTTP response - which also contains the content charset.
The problem is that Jsoup unescapes some special characters.
For example, for the input:
<html><head></head><body><p>isn’t</p></body></html>
After running
String check = "<html><head></head><body><p>isn’t</p></body></html>";
Document doc = Jsoup.parse(check);
System.out.println(doc.outerHtml());
I get:
<html><head></head><body><p>isn’t</p></body></html><p></p>
I want to avoid changing the html in any other way except for removing the images.
By using the command:
doc.outputSettings().prettyPrint(false).charset("ASCII").escapeMode(EscapeMode.extended);
I do get the correct output but I'm sure there are cases where that charset won't be good. I just want to use the charset specified in the HTTP header and I'm afraid this will change my document in ways I can't predict. Is there any other cleaner method for removing the images without changing anything else inadvertently?
Thank you!