9

I'm using Jsoup to remove all the images from an HTML page. I'm receiving the page through an HTTP response - which also contains the content charset.

The problem is that Jsoup unescapes some special characters.

For example, for the input:

<html><head></head><body><p>isn&rsquo;t</p></body></html>

After running

String check = "<html><head></head><body><p>isn&rsquo;t</p></body></html>";
Document doc = Jsoup.parse(check);
System.out.println(doc.outerHtml());

I get:

<html><head></head><body><p>isn’t</p></body></html><p></p>

I want to avoid changing the html in any other way except for removing the images.

By using the command:

doc.outputSettings().prettyPrint(false).charset("ASCII").escapeMode(EscapeMode.extended);

I do get the correct output but I'm sure there are cases where that charset won't be good. I just want to use the charset specified in the HTTP header and I'm afraid this will change my document in ways I can't predict. Is there any other cleaner method for removing the images without changing anything else inadvertently?

Thank you!

Jacob van Lingen
  • 8,989
  • 7
  • 48
  • 78
dlvhdr
  • 482
  • 4
  • 19

1 Answers1

8

Here is a workaround not involving any charset except the one specified in the HTTP header.

String check = "<html><head></head><body><p>isn&rsquo;t</p></body></html>".replaceAll("&([^;]+?);", "**$1;");

Document doc = Jsoup.parse(check);

doc.outputSettings().prettyPrint(false).escapeMode(EscapeMode.extended);

System.out.println(doc.outerHtml().replaceAll("\\*\\*([^;]+?);", "&$1;"));

OUTPUT

<html><head></head><body><p>isn&rsquo;t</p></body></html>

DISCUSSION

I wish there was a solution in Jsoup's API - @dlv

Using Jsoup'API would require you to write a custom NodeVisitor. It would leads to (re)inventing some existing code inside Jsoup. The custom Nodevisitor would generate back an HTML escape code instead of a unicode character.

Another option would involve writing a custom character encoder. The default UTF-8 character encoder can encode &rsquo;. This is why Jsoup doesn't preserve the original escape sequence in the final HTML code.

Any of the two above options represents a big coding effort. Ultimately, an enhancement could be added to Jsoup for letting us choose how to generate the characters in the final HTML code : hexadecimal escape (&#AB;), decimal escape (&#151;), the original escape sequence (&rsquo;) or write the encoded character (which is the case in your post).

Stephan
  • 41,764
  • 65
  • 238
  • 329
  • Thank you, I will use this for now although I wish there was a solution in Jsoup's API. – dlvhdr Dec 29 '15 at 08:43
  • 1
    @Ravisha You can find this information in the "What's new" section on this page : https://jsoup.org/download – Stephan Jul 27 '20 at 07:14
  • 2
    I ran into an issue where a client had multiple asterisks in their content and this logic was prefixing unwanted ampersands to the content. Instead of the asterisk character (*), I used an invisible ASCII 31 (unit separator). – James Moberg Sep 16 '20 at 19:26
  • My text contains numerical entities, like à, and this method replaces them by html entities, like à. Is there a mean to avoid that? – Alexis Dufrenoy Aug 04 '21 at 14:10