0

Consider the following:

String s = "<tag>This has a &lt;a href=\"#\"&gt;link&lt;a&gt;.</tag>";
final XML xml = new XMLDocument(s);
String extractedText = xml.xpath("//tag/text()").get(0);
System.out.println(extractedText); // Output: This has a <a href="#">link</a>.
System.out.println(s.contains(extractedText)); // Output: false!
System.out.println(s.contains("This has a &lt;a href=\"#\"&gt;link&lt;a&gt;.")); // Output: true

I have an XML file given as a string with some escaped HTML. Using the jcabi library, I get the text of the relevant elements (in this case everything in <tag>s). However, what I get isn't actually what's in the original string--I'm expecting &lt; and &gt; but am getting < and > instead. The original string paradoxically does not contain the substring that I extracted from it.

How can I get the actual text and not an unescaped version?

idlackage
  • 2,715
  • 8
  • 31
  • 52
  • Why do you expect the XML parser not to interpret entities like `>`? Can you just [escape the string](http://stackoverflow.com/a/439494/223424) you receive from `.xpath()`? – 9000 Mar 01 '17 at 21:47
  • @9000 I can't use the escape functions due to them escaping the quotes also (") and possibly a bunch of other things that don't match the source. Said source is out of my control and comes with some things escaped and some things unescaped. I need to do replacements on it so I need the extraction to be in the exact format that I got it in. – idlackage Mar 01 '17 at 21:58
  • So, the source is in a non-canonical form, neither fully escaped nor fully unescaped. Bring it to the canonical form prior to doing anything with it. Stop considering it a text, for it's a serialized form of an XML document. Work with it as such, don't cling to the particular input form. – 9000 Mar 01 '17 at 22:00

0 Answers0