6

I want to preserve html entities while using JSoup. Here is an utf-8 test string from a website:

String html = "<html><body>hello &#151; world</body></html>";

String parsed = Jsoup.parse(html).toString();

If printing the parsed output in utf-8, it looks like the sequence &#151 gets transformed into a character with a code point value of 151.

Is there a way to have JSoup preserve the original entity when outputting as utf-8? If I output in ascii encoding:

Document.OutputSettings settings = new Document.OutputSettings();
settings.charset(Charset.forName("ascii"));
Jsoup.parse(html).outputSettings(settings).toString();

I'll get:

hello &#x97; world

which is what I'm looking for.

user3203425
  • 2,919
  • 4
  • 29
  • 48
  • 1
    I don't think there is a way to do this. But it should be possible to output as ASCII, (what you are doing already) and use that, since ASCII Charset is compatible with utf-8. – Jonas Czech Jun 03 '15 at 12:58
  • Possible duplicate of [Jsoup unescapes special characters](http://stackoverflow.com/questions/34368908/jsoup-unescapes-special-characters) – Stephan Jan 21 '16 at 11:15

1 Answers1

2

You have hitted a missing feature of Jsoup (as of this writing Jsoup 1.8.3).

I can see three options:

Option 1

Send a request for feature on https://github.com/jhy/jsoup I'm not sure you'll get added soon...

Option 2

Use the workaround provided in this SO answer: https://stackoverflow.com/a/34493022/363573

Option 3

Write a custom NodeVisitor that will turn character with a code point value back to their HTML equivalent escape sequence.

Community
  • 1
  • 1
Stephan
  • 41,764
  • 65
  • 238
  • 329