2

I want to write a string into an XML node, but I have to strip any forbidden characters before doing so. I found the following piece to work:

preg_replace("/[^\\x0009\\x000A\\x000D\\x0020-\\xD7FF\\xE000-\\xFFFD]/", "", $var)

However, it removes alot of characters that I want to keep. Such as space, ;, &, <, > \, and /.

I did some searching and found space to be x0020 so I tried first to allow spaces by changing the above code to:

preg_replace("/[^\\x0009\\x000A\\x000D\\x0021-\\xD7FF\\xE000-\\xFFFD]/", "", $var)

but it still removes spaces. I just want to remove those weird hidden "command" characters. How can I do that?

EDIT: I have previously made $var with htmlspecialchars(), hence why I want to keep & and ;

Chris
  • 57,622
  • 19
  • 111
  • 137

2 Answers2

1

You don't have to strip them.

If you use an XML API like DOM or XMLWriter it will encode the special characters into entities:

$document = new DOMDocument('1.0', 'UTF-8');
$document
  ->appendChild($document->createElement('foo'))
  ->appendChild($document->createTextNode("\x09\x0A\x0D\x20 ä ç <&>"));

echo $document->saveXml();

Output:

<?xml version="1.0" encoding="UTF-8"?>
<foo>   
&#13;  ä ç &lt;&amp;&gt;</foo>

The XML parser will decode them again:

$document = new DOMDocument('1.0', 'UTF-8');
$document->loadXml($xml);

var_dump($document->documentElement->textContent);

Output:

string(14) "    

  ä ç <&>"
ThW
  • 19,120
  • 3
  • 22
  • 44
0

Do you need to add a "u" to the end of your regex, so PHP knows you want Unicode matching? See also UTF-8 in PHP regular expressions

I also wonder if you might want to replace those characters with spaces, rather than nothing. Depends on what you're doing, but since you're dropping newlines, so as is you could have words joining up across lines.

Community
  • 1
  • 1
TextGeek
  • 1,196
  • 11
  • 23