0

I have a problem when I read specific characters from my XML file to the PHP file.

I use characters like "ä" , "ü" and "ö". I get the following error:

simplexml_load_string() [function.simplexml-load-string]: Entity: line 96: parser error : Input is not proper UTF-8, indicate encoding ! Bytes: 0xFC 0x73 0x65 0x0C

hakre
  • 193,403
  • 52
  • 435
  • 836
Abid
  • 91
  • 1
  • 8

2 Answers2

0

PHP 5 and earlier versions have no native Unicode support. PHP 6 or 7, where the Unicode support has been promised, may take years. To bridge the gap, there exist several extensions like mbstring, iconv and intl.

Make sure you send the HTML Response with an appropriate content-type and encoding, e.g.

<?php header('Content-Type: text/html; charset=utf-8');?>

Also check that the XML file prolog contains the proper encoding, e.g.

<?xml version="1.0" encoding="UTF-8"?>

Assuming that is all correct, it appears that the xml file is claiming to be UTF-8 but is actually something else (likely latin1 or ISO-8859-1 or Mojibake.). You can manually open the XML file in your favorite editor (I like Sublime) and save the file explicitly with a UTF8 Encoding. Or you can use a function to attempt to modify the string before loading. Like the one from: Error: "Input is not proper UTF-8, indicate encoding !" using PHP's simplexml_load_string

function fix_latin1_mangled_with_utf8_maybe_hopefully_most_of_the_time($str)
{
    return preg_replace_callback('#[\\xA1-\\xFF](?![\\x80-\\xBF]{2,})#', 'utf8_encode_callback', $str);
}

function utf8_encode_callback($m)
{
    return utf8_encode($m[0]);
}

But at the end of the day, it's going to be messy and PHP still doesn't seem to handle Unicode as well as we would all like it to and it simply isn't built into the core.

We suggest you check out Portable UTF-8 - a Lightweight Library for Unicode Handling in PHP.

Community
  • 1
  • 1
bubba
  • 3,839
  • 21
  • 25
0

The string of the XML you've got is not properly encoded. The default encoding is UTF-8 however the string you've got is different, most likely Windows-1252.

If you want that error to go away, you need to re-encode the string from that (by the missing information in your question:) unknown encoding to UTF-8.

As an encoding if it is unknown is broken, you need to find out/learn about the encoding of the string first.

Then you can just convert it to UTF-8 or inject the encoding into the XML string which is easily possible with XMLRecoder - Inspect and modify character encoding of an XML document based on XML Declaration and BOM. Parts of it are explained in PHP XMLReader, get the version and encoding which is about XMLReader but like SimpleXML, it is also a libxml based PHP XML extension and shares some of the commons, so this works.

Usage example:

$buffer = file_get_contents($file);

$fromEncoding = 'WINDOWS-1252';  # insert *your* correct string encoding here

$recoder = new XMLRecoder();
$buffer  = $recoder->setEncodingDeclaration($buffer, $fromEncoding);

$sxml = simplexml_load_string($buffer);

To better understand XML encodings in PHP and the available charset encodings and names, please see:

Community
  • 1
  • 1
hakre
  • 193,403
  • 52
  • 435
  • 836