36

I am generating XML using PHP library as below:

$dom = new DOMDocument("1.0","utf-8");

Doing above results in a page which shows a message on top of the output.

This page contains the following errors: error on line 16 at column 274505: PCDATA invalid Char value 27 Below is a rendering of the page up to the first error.

I have tried rectifying using Tidy library.. used iconv to get the chinese character in UTF-8.

j0k
  • 22,600
  • 28
  • 79
  • 90
Prashant
  • 2,005
  • 3
  • 17
  • 24

2 Answers2

110

A useful function to get rid of that error is suggested on this website. http://www.phpwact.org/php/i18n/charsets#common_problem_areas_with_utf-8

When you put utf-8 encoded strings in a XML document you should remember that not all utf-8 valid chars are accepted in a XML document http://www.w3.org/TR/REC-xml/#charsets

So you should strip away the unwanted chars, else you’ll have an XML fatal parsing error such as above

function utf8_for_xml($string)
{
    return preg_replace ('/[^\x{0009}\x{000a}\x{000d}\x{0020}-\x{D7FF}\x{E000}-\x{FFFD}]+/u', ' ', $string);
}

Hope that saves someone else some time..

j0k
  • 22,600
  • 28
  • 79
  • 90
Prashant
  • 2,005
  • 3
  • 17
  • 24
  • Thank you very much. I am quite surprised that php xml writer does not do these things itself. – Michal Sep 01 '16 at 17:43
  • 1
    Here is an equivalent sanitisation function in **ruby**, in case anyone finds it useful: `sring.gsub(/[^\u{0009}\u{000a}\u{000d}\u{0020}-\u{D7FF}\u{E000}-\u{FFFD}]+/u, ' ')` ... Or, more efficiently, this can also be achieved with: `string.tr("^\u{0009}\u{000a}\u{000d}\u{0020}-\u{D7FF}\u{E000}-\u{FFFD}", ' ')` – Tom Lord Nov 17 '16 at 09:43
  • Thank you so much Prashant!! – ijpatricio Nov 22 '16 at 22:03
  • 4
    This is awesome. I see that I have liked this already. I want to give you another like. – Michal Jan 31 '17 at 17:18
  • I wasted 2 days because of this. Thank you very much! – Supun Kavinda Oct 22 '20 at 16:07
  • For me, this function returns NULL. Possibly because the input is not UTF-8. Not sure what the input is... – Wouter May 25 '21 at 15:30
7

Prashant is absolutely right. You can also strip away invalid characters in Javascript by doing:

function utf8_for_xml(inputStr) {
  return inputStr.replace(/[^\x09\x0A\x0D\x20-\xFF\x85\xA0-\uD7FF\uE000-\uFDCF\uFDE0-\uFFFD]/gm, '');
}
Quang Tran
  • 93
  • 1
  • 2
  • 6