Jsoup unescapes special characters

Question

I'm using Jsoup to remove all the images from an HTML page. I'm receiving the page through an HTTP response - which also contains the content charset.

The problem is that Jsoup unescapes some special characters.

For example, for the input:

<html><head></head><body><p>isn&rsquo;t</p></body></html>

After running

String check = "<html><head></head><body><p>isn&rsquo;t</p></body></html>";
Document doc = Jsoup.parse(check);
System.out.println(doc.outerHtml());

I get:

<html><head></head><body><p>isn’t</p></body></html><p></p>

I want to avoid changing the html in any other way except for removing the images.

By using the command:

doc.outputSettings().prettyPrint(false).charset("ASCII").escapeMode(EscapeMode.extended);

I do get the correct output but I'm sure there are cases where that charset won't be good. I just want to use the charset specified in the HTTP header and I'm afraid this will change my document in ways I can't predict. Is there any other cleaner method for removing the images without changing anything else inadvertently?

Thank you!

Stephan · Accepted Answer · 2016-01-21T15:35:58.100

Here is a workaround not involving any charset except the one specified in the HTTP header.

String check = "<html><head></head><body><p>isn&rsquo;t</p></body></html>".replaceAll("&([^;]+?);", "**$1;");

Document doc = Jsoup.parse(check);

doc.outputSettings().prettyPrint(false).escapeMode(EscapeMode.extended);

System.out.println(doc.outerHtml().replaceAll("\\*\\*([^;]+?);", "&$1;"));

OUTPUT

<html><head></head><body><p>isn&rsquo;t</p></body></html>

DISCUSSION

I wish there was a solution in Jsoup's API - @dlv

Using Jsoup'API would require you to write a custom NodeVisitor. It would leads to (re)inventing some existing code inside Jsoup. The custom Nodevisitor would generate back an HTML escape code instead of a unicode character.

Another option would involve writing a custom character encoder. The default UTF-8 character encoder can encode ’. This is why Jsoup doesn't preserve the original escape sequence in the final HTML code.

Any of the two above options represents a big coding effort. Ultimately, an enhancement could be added to Jsoup for letting us choose how to generate the characters in the final HTML code : hexadecimal escape (&#AB;), decimal escape (), the original escape sequence (’) or write the encoded character (which is the case in your post).

Thank you, I will use this for now although I wish there was a solution in Jsoup's API. — dlvhdr, Dec 29 '15 at 08:43
@Ravisha You can find this information in the "What's new" section on this page : https://jsoup.org/download — Stephan, Jul 27 '20 at 07:14
I ran into an issue where a client had multiple asterisks in their content and this logic was prefixing unwanted ampersands to the content. Instead of the asterisk character (*), I used an invisible ASCII 31 (unit separator). — James Moberg, Sep 16 '20 at 19:26
My text contains numerical entities, like à, and this method replaces them by html entities, like à. Is there a mean to avoid that? — Alexis Dufrenoy, Aug 04 '21 at 14:10

Jsoup unescapes special characters

1 Answers1

Linked