How to skip invalid characters in XML file using PHP

Question

I'm trying to parse an XML file using PHP, but I get an error message:

parser error : Char 0x0 out of allowed range in

I think it's because of the content of the XML, I think there is a speical symbol "☆", any ideas what I can do to fix it?

I also get:

parser error : Premature end of data in tag item line

What might be causing that error?

I'm using simplexml_load_file.

Update:

I try to find the error line and paste its content as single xml file and it can work!! so I still cannot figure out what makes xml file parse fails. PS it's a huge xml file over 100M, will it makes parse error?

score 42 · Accepted Answer · edited Jul 06 '21 at 09:18

42

Do you have control over the XML? If so, ensure the data is enclosed in <![CDATA[ .. ]]> blocks.

And you also need to clear the invalid characters:

/**
 * Removes invalid XML
 *
 * @access public
 * @param string $value
 * @return string
 */
function stripInvalidXml($value)
{
    $ret = "";
    $current;
    if (empty($value)) 
    {
        return $ret;
    }
 
    $length = strlen($value);
    for ($i=0; $i < $length; $i++)
    {
        $current = ord($value[$i]);
        if (($current == 0x9) ||
            ($current == 0xA) ||
            ($current == 0xD) ||
            (($current >= 0x20) && ($current <= 0xD7FF)) ||
            (($current >= 0xE000) && ($current <= 0xFFFD)) ||
            (($current >= 0x10000) && ($current <= 0x10FFFF)))
        {
            $ret .= chr($current);
        }
        else
        {
            $ret .= " ";
        }
    }
    return $ret;
}

edited Jul 06 '21 at 09:18

marcovtwout

5,230
4
36
45

answered Aug 12 '10 at 08:54

Jhong

2,714
22
19

I can't control XML but I can ask...but it's the solution?? let me check it – user315396 Aug 12 '10 at 08:56
3

Not sure if this would help in this case. You can't fix encoding problems with CDATA, only escaping issues like "&" instead of "&". – Matthew Wilson Aug 12 '10 at 09:57
Yes, I agree. Dominic has the solution. – Jhong Aug 12 '10 at 10:07
user315396: Sorry but no way you've fixed "out of allowed range" with a CData section. – chendral Aug 13 '10 at 08:03
Hi, I'm having problems making this function work for me. In my particular case, I'm receiving the string from a $_GET variable, utf8_encoded like 'Canci%EF%BF%BDn' and I need to remove the '%EF%BF%BD' part. See: http://codepad.viper-7.com/NJoPRG – Cesar Aug 29 '12 at 23:22
@Jhong thank you very much for this function, it saved me from a lot of pain. But I want to know more about this 'illegal' characters, what they are and, especially if there's MORE like them that can break my xml processing using DOMDocument and lead to incomplete results for me... Do you have somewhere such explained? – Miloš Đakonović Jun 10 '13 at 09:58
3

This function is broken. `ord()` is only operating on single-bytes. – hakre Jun 29 '13 at 00:12
FWIW, if you're getting an error loading an XML string into DOMDocument::loadXML(), `return utf8_encode($ret)` will also encode it as UTF-8, after removing the invalid characters. – Curtis Mattoon Mar 18 '15 at 16:19
This worked for me. I was having the "Char 0x0 out of allowed range in Entity" error while calling DOMDocument::loadXML(): – Ledazinha Aug 09 '22 at 16:03

mikeytown2 · Answer 2 · 2014-04-16T22:04:20.637

I decided to test all UTF-8 values (0-1114111) to make sure things work as they should. Using preg_replace() causes a NULL to be returned due to errors when testing all utf-8 values. This is the solution I've come up.

$utf_8_range = range(0, 1114111);
$output = ords_to_utfstring($utf_8_range);
$sanitized = sanitize_for_xml($output);


/**
 * Removes invalid XML
 *
 * @access public
 * @param string $value
 * @return string
 */
function sanitize_for_xml($input) {
  // Convert input to UTF-8.
  $old_setting = ini_set('mbstring.substitute_character', '"none"');
  $input = mb_convert_encoding($input, 'UTF-8', 'auto');
  ini_set('mbstring.substitute_character', $old_setting);

  // Use fast preg_replace. If failure, use slower chr => int => chr conversion.
  $output = preg_replace('/[^\x{0009}\x{000a}\x{000d}\x{0020}-\x{D7FF}\x{E000}-\x{FFFD}]+/u', '', $input);
  if (is_null($output)) {
    // Convert to ints.
    // Convert ints back into a string.
    $output = ords_to_utfstring(utfstring_to_ords($input), TRUE);
  }
  return $output;
}

/**
 * Given a UTF-8 string, output an array of ordinal values.
 *
 * @param string $input
 *   UTF-8 string.
 * @param string $encoding
 *   Defaults to UTF-8.
 *
 * @return array
 *   Array of ordinal values representing the input string.
 */
function utfstring_to_ords($input, $encoding = 'UTF-8'){
  // Turn a string of unicode characters into UCS-4BE, which is a Unicode
  // encoding that stores each character as a 4 byte integer. This accounts for
  // the "UCS-4"; the "BE" prefix indicates that the integers are stored in
  // big-endian order. The reason for this encoding is that each character is a
  // fixed size, making iterating over the string simpler.
  $input = mb_convert_encoding($input, "UCS-4BE", $encoding);

  // Visit each unicode character.
  $ords = array();
  for ($i = 0; $i < mb_strlen($input, "UCS-4BE"); $i++) {
    // Now we have 4 bytes. Find their total numeric value.
    $s2 = mb_substr($input, $i, 1, "UCS-4BE");
    $val = unpack("N", $s2);
    $ords[] = $val[1];
  }
  return $ords;
}

/**
 * Given an array of ints representing Unicode chars, outputs a UTF-8 string.
 *
 * @param array $ords
 *   Array of integers representing Unicode characters.
 * @param bool $scrub_XML
 *   Set to TRUE to remove non valid XML characters.
 *
 * @return string
 *   UTF-8 String.
 */
function ords_to_utfstring($ords, $scrub_XML = FALSE) {
  $output = '';
  foreach ($ords as $ord) {
    // 0: Negative numbers.
    // 55296 - 57343: Surrogate Range.
    // 65279: BOM (byte order mark).
    // 1114111: Out of range.
    if (   $ord < 0
        || ($ord >= 0xD800 && $ord <= 0xDFFF)
        || $ord == 0xFEFF
        || $ord > 0x10ffff) {
      // Skip non valid UTF-8 values.
      continue;
    }
    // 9: Anything Below 9.
    // 11: Vertical Tab.
    // 12: Form Feed.
    // 14-31: Unprintable control codes.
    // 65534, 65535: Unicode noncharacters.
    elseif ($scrub_XML && (
               $ord < 0x9
            || $ord == 0xB
            || $ord == 0xC
            || ($ord > 0xD && $ord < 0x20)
            || $ord == 0xFFFE
            || $ord == 0xFFFF
            )) {
      // Skip non valid XML values.
      continue;
    }
    // 127: 1 Byte char.
    elseif ( $ord <= 0x007f) {
      $output .= chr($ord);
      continue;
    }
    // 2047: 2 Byte char.
    elseif ($ord <= 0x07ff) {
      $output .= chr(0xc0 | ($ord >> 6));
      $output .= chr(0x80 | ($ord & 0x003f));
      continue;
    }
    // 65535: 3 Byte char.
    elseif ($ord <= 0xffff) {
      $output .= chr(0xe0 | ($ord >> 12));
      $output .= chr(0x80 | (($ord >> 6) & 0x003f));
      $output .= chr(0x80 | ($ord & 0x003f));
      continue;
    }
    // 1114111: 4 Byte char.
    elseif ($ord <= 0x10ffff) {
      $output .= chr(0xf0 | ($ord >> 18));
      $output .= chr(0x80 | (($ord >> 12) & 0x3f));
      $output .= chr(0x80 | (($ord >> 6) & 0x3f));
      $output .= chr(0x80 | ($ord & 0x3f));
      continue;
    }
  }
  return $output;
}

And to do this on a simple object or array

// Recursive sanitize_for_xml.
function recursive_sanitize_for_xml(&$input){
  if (is_null($input) || is_bool($input) || is_numeric($input)) {
    return;
  }
  if (!is_array($input) && !is_object($input)) {
    $input = sanitize_for_xml($input);
  }
  else {
    foreach ($input as &$value) {
      recursive_sanitize_for_xml($value);
    }
  }
}

This has been very helpful! though I have yet to find an input that triggers the 'long' version, do you have any examples for when its needed? — mcfedr, Nov 07 '17 at 12:31
@mcfedr preg_replace could be fixed in later versions of php. This was php 5.4 I believe — mikeytown2, Nov 07 '17 at 14:03

nwellnhof · Answer 3 · 2021-02-17T21:54:04.763

Certain Unicode characters must not appear in XML 1.0:

C0 control codes (U+0000 - U+001F) expect tab, CR and LF.
UTF-16 surrogates (U+D800 - U+DFFF). These are invalid in UTF-8 as well and indicate more serious problems when encountered.
U+FFFE and U+FFFF.

But in practice, you often have to handle XML which was carelessly produced from other sources containing such characters. If you want to handle this special case of invalid XML in an UTF-8 encoded string, I'd suggest:

$str = preg_replace(
    '/[\x00-\x08\x0B\x0C\x0E-\x1F]|\xED[\xA0-\xBF].|\xEF\xBF[\xBE\xBF]/',
    "\xEF\xBF\xBD",
    $str
);

This doesn't use the u Unicode regex modifier but works directly on UTF-8 encoded bytes for extra performance. The parts of the pattern are:

Invalid control chars: [\x00-\x08\x0B\x0C\x0E-\x1F]
UTF-16 surrogates: \xED[\xA0-\xBF].
Non-characters U+FFFE and U+FFFF: \xEF\xBF[\xBE\xBF]

Invalid characters are replaced with the replacement character U+FFFD (�) instead of simply stripping them. This makes it easier to diagnose invalid chars and can even prevent security issues.

score 2 · Answer 4 · edited Apr 24 '16 at 05:51

2

If you have control over the data, ensure that it is encoded correctly (i.e. is in the encoding that you promised in the xml tag, e.g. if you have:

<?xml version="1.0" encoding="UTF-8"?>

then you'll need to ensure your data is in UTF-8.

If you don't have control over the data, yell at those who do.

You can use a tool like xmllint to check which part(s) of the data are not valid.

edited Apr 24 '16 at 05:51

culix

10,188
6
36
52

answered Aug 12 '10 at 08:56

Dominic Rodger

97,747
36
197
212

I try to find the error line and paste its content as single xml file and it can work!! so I still cannot figure out what makes xml file parse fails. – user315396 Aug 12 '10 at 09:41
In this case, this exactly reinforces what Dominic is saying. – Jhong Aug 12 '10 at 09:53
Ok...I think some data is not UTF-8..acutually if I open XML at FF, there is a error msg to mean error char. IE...hm...it's big file..I just wait for long time but no response. – user315396 Aug 12 '10 at 11:21
@DominicRodger Thank you! xmllint let me find the invalid characters and remove them. – culix Apr 24 '16 at 04:05

score 1 · Answer 5 · answered Mar 03 '15 at 14:18

My problem was "&" character (HEX 0x24), i changed to:

function stripInvalidXml($value)
{
    $ret = "";
    $current;
    if (empty($value)) 
    {
        return $ret;
    }

    $length = strlen($value);
    for ($i=0; $i < $length; $i++)
    {
        $current = ord($value{$i});
        if (($current == 0x9) ||
            ($current == 0xA) ||
            ($current == 0xD) ||

            (($current >= 0x28) && ($current <= 0xD7FF)) ||
            (($current >= 0xE000) && ($current <= 0xFFFD)) ||
            (($current >= 0x10000) && ($current <= 0x10FFFF)))
        {
            $ret .= chr($current);
        }
        else
        {
            $ret .= " ";
        }
    }
    return $ret;
}

score 0 · Answer 6 · answered Aug 12 '10 at 10:13

0

Make sure your XML source is valid. See http://en.wikipedia.org/wiki/List_of_XML_and_HTML_character_entity_references

answered Aug 12 '10 at 10:13

bcosca

17,371
5
40
51

XML is valid what if encoding is UTF-8 but there is a Big5 char , I find the char "". – user315396 Aug 12 '10 at 11:02

score 0 · Answer 7 · answered Aug 24 '21 at 15:51

I used this to clean the string:

public static function Clean($inputName)
    {
        $strName=trim($inputName);
        
        if($strName!="")
        {
            $strName = iconv("UTF-8", "UTF-8//IGNORE", $strName); // drop all non utf-8 characters
            
            $strName=str_replace(array('\\','/',':','*','?','"','<','>','|'),'@',$strName); 
            $string = preg_replace('/[\x00-\x1F\x7F\xA0]/u', '', $string);
            
            // [\x00-\x1F]  control characters http://msdn.microsoft.com/en-us/library/windows/desktop/aa365247%28v=vs.85%29.aspx   
            
            // Invalid control chars: [\x00-\x08\x0B\x0C\x0E-\x1F]
            // UTF-16 surrogates: \xED[\xA0-\xBF].
            // Non-characters U+FFFE and U+FFFF: \xEF\xBF[\xBE\xBF]
            // Invalid characters are replaced with the replacement character U+FFFD 

            $strName = preg_replace(
            '/[\x00-\x08\x0B\x0C\x0E-\x1F]|\xED[\xA0-\xBF].|\xEF\xBF[\xBE\xBF]/',
            "\xEF\xBF\xBD",
            $strName);
            
            // Reduce all multiple whitespace to a single space
            // $strName = preg_replace('/\s+/', ' ', $strName); 
            
            if(trim($strName)=="")
            {
                $strName="@" . "empty-name";
            }
        }
        else
        {
            $strName=" ";
        }       
        
        return $strName;
    }

score 0 · Answer 8 · edited May 23 '17 at 12:34

0

For a non-destructive method of loading this type of input into a SimpleXMLElement, see my answer on How to handle invalid unicode with simplexml

edited May 23 '17 at 12:34

Community

1
1

answered Nov 11 '11 at 10:34

Mike Venzke

754
6
10

score -5 · Answer 9 · answered Jul 30 '15 at 22:59

Not a php solution but, it works:

Download Notepad++ https://notepad-plus-plus.org/

Open your .xml file in Notepad++

From Main Menu: Search -> Search Mode set this to: Extended

Then,

Replace -> Find what \x00; Replace with {leave empty}

Then, Replace_All

Rob

How to skip invalid characters in XML file using PHP

Update:

9 Answers9

Linked