9

I am trying to write into an XML file using DOMDocument a link that contains the & sign. When I try this, the link becomes & in the xml. So from product=1&qty;=1 becomes product=1&qty;=1.

Can you please tell me a way to avoid this?

Makoto
  • 104,088
  • 27
  • 192
  • 230
Catalin
  • 91
  • 1
  • 1
  • 2
  • That is valid XML -- thank your XML tool. `&qty;` is invalid XML. Thank your XML tool some more. To see it "display correctly" (as just an `&`), use an XML-aware editor; i.e. not notepad. It has *nothing inherently to do with a URI*, just XML. –  Jun 16 '11 at 22:29
  • Why would you want to avoid behaviour that is perfectly correct and replace it with something incorrect? In XML, ampersand must always be escaped, even in a URL. You don't need to worry about it, the XML parser will unescape it at the other end. – Michael Kay Jun 17 '11 at 10:07

2 Answers2

9

As Gordon said, URIs are encoded this way. If you didn't encode the & to a &, the XML file would be messed up - you'd get errors parsing it. When you take the string back out of the XML file, if the &amp still shows up, either str_replace() like this:

$str = str_replace('&', '&', $str)

Or use htmlspecialchars_decode():

$str = htmlspecialchars_decode($str);

The added bonus of using htmlspecialchars_decode() is that it will decode any other HTML that might be in the string. For more, see here.

Bojangles
  • 99,427
  • 50
  • 170
  • 208
  • It has nothing to do with the URI -- and everything to do with valid XML syntax, as noted by Gordon. The XML library being used (thankfully) ensures that valid XML is being produced (in this case). –  Jun 16 '11 at 22:32
6

Ampersands should be encoded like this. Changing it would be wrong.

See http://www.w3.org/TR/xml/

The ampersand character (&) and the left angle bracket (<) MUST NOT appear in their literal form, except when used as markup delimiters, or within a comment, a processing instruction, or a CDATA section. If they are needed elsewhere, they MUST be escaped using either numeric character references or the strings &amp; and &lt; respectively.

and http://www.w3.org/TR/xhtml1/#C_12

In both SGML and XML, the ampersand character ("&") declares the beginning of an entity reference (e.g., &reg; for the registered trademark symbol "®"). Unfortunately, many HTML user agents have silently ignored incorrect usage of the ampersand character in HTML documents - treating ampersands that do not look like entity references as literal ampersands. XML-based user agents will not tolerate this incorrect usage, and any document that uses an ampersand incorrectly will not be "valid", and consequently will not conform to this specification. In order to ensure that documents are compatible with historical HTML user agents and XML-based user agents, ampersands used in a document that are to be treated as literal characters must be expressed themselves as an entity reference (e.g. "&amp;"). For example, when the href attribute of the a element refers to a CGI script that takes parameters, it must be expressed as http://my.site.dom/cgi-bin/myscript.pl?class=guest&amp;name=user rather than as http://my.site.dom/cgi-bin/myscript.pl?class=guest&name=user

hakre
  • 193,403
  • 52
  • 435
  • 836
Gordon
  • 312,688
  • 75
  • 539
  • 559
  • Thank you for the answer. I just realized that the link works also like this. – Catalin Jun 16 '11 at 22:23
  • This starts out wrong. It has nothing to do with the URI -- and everything to do with valid XML syntax, as noted below. The fact that the examples used are URIs is irrelevant as it might very well have been "Ben & Jerry". (+1 for the rest, including links and excerpts.) –  Jun 16 '11 at 22:31
  • @pst true, I could improve that. will do. – Gordon Jun 16 '11 at 22:41