1

I'm using Jsoup's parseBodyFragment() and parse() methods to work with blocks of code made up of script, noscript, and style tags. The goal isn't to clean them - just to select(), analyze, and output them. The select() portion works really well.

However, the issue is that it's automatically encoding the url parameters of src attributes. So, when the input is this:

<noscript>
<img height="1" width="1" style="display:none;" alt="" src="https://something.orother.com/i/cnt?txn_id=123&p_id=123"/>
</noscript>

I end up with this, returned from Jsoup, via the outerHTML() method:

<noscript>
<img height="1" width="1" style="display:none;" alt="" src="https://something.orother.com/i/cnt?txn_id=123&amp;p_id=123"/>
</noscript>

The issue being the standard ampersand (&) in the url parameter is being encoded and output as &amp;. Is there a way to disable this?

I'm looking for a way to get the html of the selected element without modification. Thanks!

Update (2/23/2016): Clarified problem. Also, found an issue on the Github repo describing the problem: https://github.com/jhy/jsoup/issues/372. Looks like this might not be possible.

  • you can get pagedocument using parse and later get the content using select. – Spartan Feb 23 '17 at 05:54
  • @thanga thanks - I should have been clearer; I'm able to get it using select - the issue is after I get it. It seems like Jsoup modifies the html without a way to get the original code. I found an issue on the Github repo describing it as well, so I think it might not be possible. I'll update the question to include a link to the issue. – Matthew Clemente Feb 23 '17 at 11:00

1 Answers1

0

The original HTML is invalid. An & which doesn't start a character reference must be expressed as &amp; in an HTML attribute value.

HTML parsers are expected to perform error recovery and generate a valid DOM.

Jsoup works by parsing the HTML into a DOM, letting you run queries on it, then exporting the DOM back to HTML afterwards.

You can't avoid white space normalisation, error recovery, or any of the other things that parsers do. The approach used by Jsoup to extract data is not designed to support the preservation of errors.

Quentin
  • 914,110
  • 126
  • 1,211
  • 1,335
  • thanks, but it's my understanding that HTML5 relaxed this restriction. See: http://stackoverflow.com/a/19442133/5361034, which also cites the spec: https://www.w3.org/TR/html5/syntax.html#tokenizing-character-references - – Matthew Clemente Feb 23 '17 at 15:21