84

I am responding to an AJAX call by sending it an XML document through PHP echos. In order to form this XML document, I loop through the records of a database. The problem is that the database includes records that have '<' symbols in them. So naturally, the browser throws an error at that particular spot. How can this be fixed?

Scott C Wilson
  • 19,102
  • 10
  • 61
  • 83
JayD3e
  • 2,147
  • 3
  • 21
  • 30
  • Did you try creating a function that will replace all sensible character by their xml equivalents. Or maybe include all value with potential character within "" ? – David Brunelle Aug 06 '10 at 17:17

7 Answers7

111

Since PHP 5.4 you can use:

htmlspecialchars($string, ENT_XML1);

You should specify the encoding, such as:

htmlspecialchars($string, ENT_XML1, 'UTF-8');

Update

Note that the above will only convert:

  • & to &amp;
  • < to &lt;
  • > to &gt;

If you want to escape text for use in an attribute enclosed in double quotes:

htmlspecialchars($string, ENT_XML1 | ENT_COMPAT, 'UTF-8');

will convert " to &quot; in addition to &, < and >.


And if your attributes are enclosed in single quotes:

htmlspecialchars($string, ENT_XML1 | ENT_QUOTES, 'UTF-8');

will convert ' to &apos; in addition to &, <, > and ".

(Of course you can use this even outside of attributes).


See the manual entry for htmlspecialchars.

Sébastien
  • 2,236
  • 2
  • 20
  • 28
  • 4
    htmlspecialchars($string, ENT_XML1, 'UTF-8') worked good for me, actually i do this all of them just for safety – Miguel Sep 16 '15 at 18:33
  • 1
    In cases where you are formatting a string for SimpleXML that needs to be XML validated this seems to be the cleanest working solution. I am dealing with lots of special characters being used and this solved my issues. – Ryan Rentfro Oct 22 '15 at 19:05
  • `htmlspecialchars` does not escape `\xB` (vertical tab) for instance, which is [invalid XML](https://stackoverflow.com/q/14192135/2683737). – Rainer Rillke May 22 '20 at 11:01
69

By either escaping those characters with htmlspecialchars, or, perhaps more appropriately, using a library for building XML documents, such as DOMDocument or XMLWriter.

Another alternative would be to use CDATA sections, but then you'd have to look out for occurrences of ]]>.

Take also into consideration that that you must respect the encoding you define for the XML document (by default UTF-8).

Artefacto
  • 96,375
  • 17
  • 202
  • 225
  • 6
    htmlspecialchars isn't the best way of doing it, because as the name suggests it's meant for HTML output, not XML. It will, for example, convert < to <, when for XML the correct encoding is &lt; DOMDocument, simpleXML or similar XML-aware extensions would be a better bet. – GordonM Jan 07 '11 at 12:48
  • 4
    @Gordon Hum? Since when is `<` not correct for XML? `htmlspecialchars` actually only does entity substitution with entities that are guaranteed to be available for *any* XML document, and even leaves one behind (replaces `'` with `'` when it could use `'`; of course, `'` is correct too). – Artefacto Jan 08 '11 at 00:13
  • 5
    @Gordon By the way, there are *some* reasons why `htmlspecialchars` may be insufficient for XML (namely, it doesn't replace forbidden characters in XML and it doesn't encode forbidden entities when $double_encode is TRUE) -- which, btw, I have addressed by introducing profiles in trunk's version of htmlspecialchars/entities --, but what you say is simply not true. What you're describing is a double encoding, you need `&lt;` in XML in the same circumstances you'd need it in HTML -- when you need to represent `<`. – Artefacto Jan 08 '11 at 00:15
  • 2
    Not sure if < is the best example, but it is a very real problem with htmlspecialchars. It's fundamentally intended for HTML escaping, not XML. PHP provides better tools for the job than htmlspecialchars, and those should be used instead. – GordonM Jan 08 '11 at 16:19
  • I have an issue trying to insert strings with pound signs in the data (£), and htmlentities does not work, I do not think this is the correct answer, unless for some reason I'm doing something wrong. Using htmlentities, the string it returns is not accepted by DOMDocument::loadXML function. any other suggestions? – Ninjanoel Jul 04 '13 at 16:21
  • Does the built-in Soap Client also cleanse strings like DOMDocument? – Scott Oct 07 '16 at 14:52
  • 1
    `or using a library for building XML documents, such as DOMDocument` it doesnt help – Vasilii Suricov Feb 26 '20 at 15:21
12

1) You can wrap your text as CDATA like this:

<mytag>
    <![CDATA[Your text goes here. Btw: 5<6 and 6>5]]>
</mytag>

see http://www.w3schools.com/xml/xml_cdata.asp

2) As already someone said: Escape those chars. E.g. like so:

5&lt;6 and 6&gt;5
Elvith
  • 235
  • 2
  • 6
  • *oops* I overlooked that CDATA was already mentioned in the previous answer – Elvith Aug 06 '10 at 17:21
  • You made it very clear what I needed to do, so I appreciate that, regardless of whether it was already mentioned. I ended up using your solution for a quick fix, but the best practice would probably be to use XMLWriter has Artefacto mentioned, so I'm giving the best answer to him. – JayD3e Aug 06 '10 at 17:29
  • +1 for CDATA (but be careful, XML parsers can be set up to leave CDATA blocks out of the parsed tree) – GordonM Feb 13 '13 at 17:12
7

Try this:

$str = htmlentities($str,ENT_QUOTES,'UTF-8');

So, after filtering your data using htmlentities() function, you can use the data in XML tag like:

<mytag>$str</mytag>
DontVoteMeDown
  • 21,122
  • 10
  • 69
  • 105
Mosiur
  • 1,342
  • 13
  • 16
6

If at all possible, its always a good idea to create your XML using the XML classes rather than string manipulation - one of the benefits being that the classes will automatically escape characters as needed.

Ed Schembor
  • 8,090
  • 8
  • 31
  • 37
5

Adding this in case it helps someone.

As I am working with Japanese characters, encoding has also been set appropriately. However, from time to time, I find that htmlentities and htmlspecialchars are not sufficient.

Some user inputs contain special characters that are not stripped by the above functions. In those cases I have to do this:

preg_replace('/[\x00-\x1f]/','',htmlspecialchars($string))

This will also remove certain xml-unsafe control characters like Null character or EOT. You can use this table to determine which characters you wish to omit.

Reuben L.
  • 2,806
  • 2
  • 29
  • 45
0

I prefer the way Golang does quote escaping for XML (and a few extras like newline escaping, and escaping some other characters), so I have ported its XML escape function to PHP below

function isInCharacterRange(int $r): bool {
    return $r == 0x09 ||
            $r == 0x0A ||
            $r == 0x0D ||
            $r >= 0x20 && $r <= 0xDF77 ||
            $r >= 0xE000 && $r <= 0xFFFD ||
            $r >= 0x10000 && $r <= 0x10FFFF;
}

function xml(string $s, bool $escapeNewline = true): string {
    $w = '';

    $Last = 0;
    $l = strlen($s);
    $i = 0;

    while ($i < $l) {
        $r = mb_substr(substr($s, $i), 0, 1);
        $Width = strlen($r);
        $i += $Width;
        switch ($r) {
            case '"':
                $esc = '&#34;';
                break;
            case "'":
                $esc = '&#39;';
                break;
            case '&':
                $esc = '&amp;';
                break;
            case '<':
                $esc = '&lt;';
                break;
            case '>':
                $esc = '&gt;';
                break;
            case "\t":
                $esc = '&#x9;';
                break;
            case "\n":
                if (!$escapeNewline) {
                    continue 2;
                }
                $esc = '&#xA;';
                break;
            case "\r":
                $esc = '&#xD;';
                break;
            default:
                if (!isInCharacterRange(mb_ord($r)) || (mb_ord($r) === 0xFFFD && $Width === 1)) {
                    $esc = "\u{FFFD}";
                    break;
                }

                continue 2;
        }
        $w .= substr($s, $Last, $i - $Last - $Width) . $esc;
        $Last = $i;
    }
    $w .= substr($s, $Last);
    return $w;
}

Note you'll need at least PHP7.2 because of the mb_ord usage, or you'll have to swap it out for another polyfill, but these functions are working great for us!

For anyone curious, here is the relevant Go source https://golang.org/src/encoding/xml/xml.go?s=44219:44263#L1887

Brian Leishman
  • 8,155
  • 11
  • 57
  • 93