5

I have a big XML (>15Mb) and i have to read it, parse it, and store some values in a DB. My problem is, the XML's come in in differents formats (UTF-8, ISO-8859-1).

With UTF-8 no prob. But ISO-8859-1 is giving me huge problems!! The tags come with special charachters which are not parsed correctly by XMLReader and readOuterXML()

Tryed already with, but with no luck

$xml = new XMLReader;
$xml->open($import_file,'ISO-8859-1');  

Tried with:

  • utf8_encode
  • mb_convert_encoding($stringXML, 'UTF-8' );
  • iconv("ISO-8859-1", "UTF-8//TRANSLIT", $stringXML);

The XML (simplified)

  • tag (id) --> no problem
  • tag (baños) --> problem

xml:

<?xml version="1.0" encoding="ISO-8859-1"?>
<data>
    <id><![CDATA[5531]]></id>
    <baños><![CDATA[0]]></baños>
</data>

None of them helped me.

Nacho
  • 2,057
  • 2
  • 23
  • 32
  • Show us the actual xml data. Try copy pasting it, and after also a hexdump of the actual characters that you can't parse. That will help determining the issue. – Evert Aug 19 '14 at 01:21
  • @Evert XML uploaded! Thanks – Nacho Aug 19 '14 at 01:28
  • Ok, so here's the important bit! Use `hex-dump -C` on the command line (or another hex editor) and find out which byte-value(s) are used for the `ñ` character. We want to make sure that it's _actually_ ISO-8859-1 and not something else. – Evert Aug 19 '14 at 01:44
  • @Evert I got this anwser <![CDATA[| 00000710 5d 5d 3e 3c 2f 6e 75 6d 5f 62 61 96 6f 73 3e 0d |]]>. My ñ changed by a "." – Nacho Aug 19 '14 at 11:13
  • hexdump automatically changes any non-ascii characters to `.`. The important part is on the first half of that line, the actual codes. In there your `ñ` got changed into `0x96`. 0x96 is not a valid character code in ISO-8859-1, and also not in CP-1252. So whatever your encoding is, it's something else! – Evert Aug 19 '14 at 13:46
  • ISO-8859-1 encodes `ñ` as `0xf1`, for what it's worth. – Evert Aug 19 '14 at 13:51
  • @Evert more complicated! Any idea on how to solve it? – Nacho Aug 19 '14 at 14:14
  • It would be good to talk to the people who generate the xm as they are definitely doing it wrong ;) Try fix it at the source? – Evert Aug 19 '14 at 14:23
  • @Evert wish it would be that easy! but its not possible to change that... – Nacho Aug 19 '14 at 14:28
  • Do you know what the correct result should be if it's not "baños"? Perhaps something like "ba-os" ? – Phil Jan 24 '15 at 22:49
  • Or is the issue that you are getting an error while parsing? If so, please include the error in your question. Thanks. – Phil Jan 24 '15 at 22:59

5 Answers5

0

What is your internal encoding in php? You can check it with echo mb_internal_encoding();.

If it is UTF-8, then mb_convert_encoding($data, "UTF-8") won't do anything, because the third parameter $from_encoding will be "UTF-8" already.

You have to provide the source encoding as a third parameter to the function.

So maybe this will do the trick:

//check which encoding the data has? 
$encoding = mb_detect_encoding($data);
if($encoding != "UTF-8"){
    //specify from which encoding to convert to utf-8
    $data = mb_convert_encoding($data, "UTF-8", $encoding); 
}
Max
  • 121
  • 6
  • The problem with `mb_detect_encoding` is that it doesn't support MacRoman (the actual encoding of this XML file). The complete list of the supported encodings can be found [here](http://php.net/manual/en/mbstring.supported-encodings.php). – Tip-Sy Jan 27 '15 at 14:34
  • @Tip-Sy, thank you for that hint, I wasn't aware of this. So if it should be done in pure php, it looks that you have to implement this on your own. It looks that here is a sample implementation of this: http://ctd-web.fr/blog/2011/02/18/php-detection-encodage-mac-roman-utf-8/ – Max Jan 29 '15 at 11:45
0

As @Evert pointed out, the byte code of your ñ is: 0x96, and the encoding of your XML file is in fact MacRoman (see the table here).

If you want to convert your data to UTF-8 format, here is what you need to do:

$stringXML = file_get_contents('yourFile.xml');
$data = iconv('MACINTOSH', 'UTF-8', $stringXML);

Another possibility is to use iconv as a command line:

iconv -f MACINTOSH -t UTF-8 file.xml > outputUTF8.xml

(Here is a link to the lib for Linux: http://www.gnu.org/software/libiconv/)

Tip-Sy
  • 810
  • 10
  • 18
0

I was able to successfully decode the given xml using Symfony's XmlEncoder class (https://github.com/symfony/Serializer). I stored the xml in a test.xml file to guarantee the correct encoding (since my php files are encoded in UTF-8 by default).

$encoder = new Symfony\Component\Serializer\Encoder\XmlEncoder();
$data = $encoder->decode(file_get_contents('test.xml'), 'xml');
//$data = ['id' = 5531, 'baños' => 0]
Eelke van den Bos
  • 1,423
  • 1
  • 13
  • 18
0

If there's a problem with special characters in the XML tags, here's a quick a dirty way of cleaning up the tags before parsing:

$xml = <<<END
<?xml version="1.0" encoding="ISO-8859-1"?>
<data>
    <id><![CDATA[5531]]></id>
    <baños><![CDATA[0]]></baños>
</data>
END;

function FilterXML($matches)
{
  return $matches[1] . preg_replace('/[^a-z]/ui', '_', $matches[2]) .
    $matches[3];
}

var_dump(preg_replace_callback('#(</?)([^!?]+?)(\\s|>)#', 'FilterXML', $xml));

It will replace <baños> with <ba_os>.

Thomas Sahlin
  • 796
  • 1
  • 4
  • 4
-1

You can try to read the XML file first, and then convert the special characters, and then read the XML string using XMLReader.

Here's the code:

<?php
header("Content-Type: text/plain; charset=ISO-8859-1");
function normalizeChars($s){
    $replace = array(
        '&amp;' => 'and', '@' => 'at', '©' => 'c', '®' => 'r', 'À' => 'a',
        'Á' => 'a', 'Â' => 'a', 'Ä' => 'a', 'Å' => 'a', 'Æ' => 'ae','Ç' => 'c',
        'È' => 'e', 'É' => 'e', 'Ë' => 'e', 'Ì' => 'i', 'Í' => 'i', 'Î' => 'i',
        'Ï' => 'i', 'Ò' => 'o', 'Ó' => 'o', 'Ô' => 'o', 'Õ' => 'o', 'Ö' => 'o',
        'Ø' => 'o', 'Ù' => 'u', 'Ú' => 'u', 'Û' => 'u', 'Ü' => 'u', 'Ý' => 'y',
        'ß' => 'ss','à' => 'a', 'á' => 'a', 'â' => 'a', 'ä' => 'a', 'å' => 'a',
        'æ' => 'ae','ç' => 'c', 'è' => 'e', 'é' => 'e', 'ê' => 'e', 'ë' => 'e',
        'ì' => 'i', 'í' => 'i', 'î' => 'i', 'ï' => 'i', 'ò' => 'o', 'ó' => 'o',
        'ô' => 'o', 'õ' => 'o', 'ö' => 'o', 'ø' => 'o', 'ù' => 'u', 'ú' => 'u',
        'û' => 'u', 'ü' => 'u', 'ý' => 'y', 'þ' => 'p', 'ÿ' => 'y', 'Ā' => 'a',
        'ā' => 'a', 'Ă' => 'a', 'ă' => 'a', 'Ą' => 'a', 'ą' => 'a', 'Ć' => 'c',
        'ć' => 'c', 'Ĉ' => 'c', 'ĉ' => 'c', 'Ċ' => 'c', 'ċ' => 'c', 'Č' => 'c',
        'č' => 'c', 'Ď' => 'd', 'ď' => 'd', 'Đ' => 'd', 'đ' => 'd', 'Ē' => 'e',
        'ē' => 'e', 'Ĕ' => 'e', 'ĕ' => 'e', 'Ė' => 'e', 'ė' => 'e', 'Ę' => 'e',
        'ę' => 'e', 'Ě' => 'e', 'ě' => 'e', 'Ĝ' => 'g', 'ĝ' => 'g', 'Ğ' => 'g',
        'ğ' => 'g', 'Ġ' => 'g', 'ġ' => 'g', 'Ģ' => 'g', 'ģ' => 'g', 'Ĥ' => 'h',
        'ĥ' => 'h', 'Ħ' => 'h', 'ħ' => 'h', 'Ĩ' => 'i', 'ĩ' => 'i', 'Ī' => 'i',
        'ī' => 'i', 'Ĭ' => 'i', 'ĭ' => 'i', 'Į' => 'i', 'į' => 'i', 'İ' => 'i',
        'ı' => 'i', 'IJ' => 'ij','ij' => 'ij','Ĵ' => 'j', 'ĵ' => 'j', 'Ķ' => 'k',
        'ķ' => 'k', 'ĸ' => 'k', 'Ĺ' => 'l', 'ĺ' => 'l', 'Ļ' => 'l', 'ļ' => 'l',
        'Ľ' => 'l', 'ľ' => 'l', 'Ŀ' => 'l', 'ŀ' => 'l', 'Ł' => 'l', 'ł' => 'l',
        'Ń' => 'n', 'ń' => 'n', 'Ņ' => 'n', 'ņ' => 'n', 'Ň' => 'n', 'ň' => 'n',
        'ʼn' => 'n', 'Ŋ' => 'n', 'ŋ' => 'n', 'Ō' => 'o', 'ō' => 'o', 'Ŏ' => 'o',
        'ŏ' => 'o', 'Ő' => 'o', 'ő' => 'o', 'Œ' => 'oe','œ' => 'oe','Ŕ' => 'r',
        'ŕ' => 'r', 'Ŗ' => 'r', 'ŗ' => 'r', 'Ř' => 'r', 'ř' => 'r', 'Ś' => 's',
        'ś' => 's', 'Ŝ' => 's', 'ŝ' => 's', 'Ş' => 's', 'ş' => 's', 'Š' => 's',
        'š' => 's', 'Ţ' => 't', 'ţ' => 't', 'Ť' => 't', 'ť' => 't', 'Ŧ' => 't',
        'ŧ' => 't', 'Ũ' => 'u', 'ũ' => 'u', 'Ū' => 'u', 'ū' => 'u', 'Ŭ' => 'u',
        'ŭ' => 'u', 'Ů' => 'u', 'ů' => 'u', 'Ű' => 'u', 'ű' => 'u', 'Ų' => 'u',
        'ų' => 'u', 'Ŵ' => 'w', 'ŵ' => 'w', 'Ŷ' => 'y', 'ŷ' => 'y', 'Ÿ' => 'y',
        'Ź' => 'z', 'ź' => 'z', 'Ż' => 'z', 'ż' => 'z', 'Ž' => 'z', 'ž' => 'z',
        'ſ' => 'z', 'Ə' => 'e', 'ƒ' => 'f', 'Ơ' => 'o', 'ơ' => 'o', 'Ư' => 'u',
        'ư' => 'u', 'Ǎ' => 'a', 'ǎ' => 'a', 'Ǐ' => 'i', 'ǐ' => 'i', 'Ǒ' => 'o',
        'ǒ' => 'o', 'Ǔ' => 'u', 'ǔ' => 'u', 'Ǖ' => 'u', 'ǖ' => 'u', 'Ǘ' => 'u',
        'ǘ' => 'u', 'Ǚ' => 'u', 'ǚ' => 'u', 'Ǜ' => 'u', 'ǜ' => 'u', 'Ǻ' => 'a',
        'ǻ' => 'a', 'Ǽ' => 'ae','ǽ' => 'ae','Ǿ' => 'o', 'ǿ' => 'o', 'ə' => 'e',
        'Ё' => 'jo','Є' => 'e', 'І' => 'i', 'Ї' => 'i', 'А' => 'a', 'Б' => 'b',
        'В' => 'v', 'Г' => 'g', 'Д' => 'd', 'Е' => 'e', 'Ж' => 'zh','З' => 'z',
        'И' => 'i', 'Й' => 'j', 'К' => 'k', 'Л' => 'l', 'М' => 'm', 'Н' => 'n',
        'О' => 'o', 'П' => 'p', 'Р' => 'r', 'С' => 's', 'Т' => 't', 'У' => 'u',
        'Ф' => 'f', 'Х' => 'h', 'Ц' => 'c', 'Ч' => 'ch','Ш' => 'sh','Щ' => 'sch',
        'Ъ' => '-', 'Ы' => 'y', 'Ь' => '-', 'Э' => 'je','Ю' => 'ju','Я' => 'ja',
        'а' => 'a', 'б' => 'b', 'в' => 'v', 'г' => 'g', 'д' => 'd', 'е' => 'e',
        'ж' => 'zh','з' => 'z', 'и' => 'i', 'й' => 'j', 'к' => 'k', 'л' => 'l',
        'м' => 'm', 'н' => 'n', 'о' => 'o', 'п' => 'p', 'р' => 'r', 'с' => 's',
        'т' => 't', 'у' => 'u', 'ф' => 'f', 'х' => 'h', 'ц' => 'c', 'ч' => 'ch',
        'ш' => 'sh','щ' => 'sch','ъ' => '-','ы' => 'y', 'ь' => '-', 'э' => 'je',
        'ю' => 'ju','я' => 'ja','ё' => 'jo','є' => 'e', 'і' => 'i', 'ї' => 'i',
        'Ґ' => 'g', 'ґ' => 'g', 'א' => 'a', 'ב' => 'b', 'ג' => 'g', 'ד' => 'd',
        'ה' => 'h', 'ו' => 'v', 'ז' => 'z', 'ח' => 'h', 'ט' => 't', 'י' => 'i',
        'ך' => 'k', 'כ' => 'k', 'ל' => 'l', 'ם' => 'm', 'מ' => 'm', 'ן' => 'n',
        'נ' => 'n', 'ס' => 's', 'ע' => 'e', 'ף' => 'p', 'פ' => 'p', 'ץ' => 'C',
        'צ' => 'c', 'ק' => 'q', 'ר' => 'r', 'ש' => 'w', 'ת' => 't', '™' => 'tm',
        'ñ' => 'n',
    );
    return strtr($s, $replace);
}

$path_to_file = '';
$xml_text = @file_get_contents($path_to_file);
if(!empty($xml_text)){
    $xml_text = normalizeChars($xml_text);
    $xml = new XMLReader();
    $xml->XML($xml_text);
}
?>

On another note, if you're looking for performance, then you should try SimpleXML and DOM Document as mentioned in the following StackOverflow question: https://stackoverflow.com/a/1835324/1337185

EDIT:

I added header("Content-Type: text/plain; charset=ISO-8859-1") because strtr works only with ISO-8859-1. I tried it with the XML string provided by the OP and it's working perfectly. If there's any missing character, feel free to add it the the array.

Community
  • 1
  • 1
Wissam El-Kik
  • 2,469
  • 1
  • 17
  • 21
  • I've done it like this: but does not replace the characters... $content = @file_get_contents("www.path.com/file.xml"); $content = $this->normalizeChars($content); echo "REPLACE DONE!
    "; $newxml = simplexml_load_string($content); print_r($newxml);
    – Nacho Jan 20 '15 at 17:08
  • I'm glad it worked. Please vote on the answer if you think it solved your problem. – Wissam El-Kik Jan 20 '15 at 20:35
  • Nope.. @Wissam El-Kik I wrote "does not replace the characters". It doesn't solved my problem. – Nacho Jan 20 '15 at 22:20
  • @user2855036 I just realized that `strtr` needs a charset ISO-8859-1. Now it should work. – Wissam El-Kik Jan 21 '15 at 12:20
  • @user2855036 where it's mentioned ? I can't find it in the webpage you provided. That's the reason, I'm adding a note at the bottom of the page to explain this weird behavior. The note isn't online yet. – Wissam El-Kik Jan 21 '15 at 16:33
  • @user2855036 I tried the function with the XML string you provided and it's working perfectly but you need to write `header("Content-Type: text/plain; charset=ISO-8859-1")` at the top of the document. – Wissam El-Kik Jan 21 '15 at 16:33
  • 1
    This is a hack to modify the source before parsing which normally doesnt solve the underlying encoding issue. You should use iconv rather than trying to reinvent the wheel. Why should 'ת' map to 't' and not 'n'? There is no semantics if you do it this way. – Phil Jan 24 '15 at 22:05
  • I totally agree with you on that point, but the OP tried the `iconv` function and it didn't work (most probably because he didn't set the "in_charset" properly). The OP didn't mentioned which charset he's using. The array used above is used in the Magento framework and I had to add manually the following `'ñ' => 'n'`. – Wissam El-Kik Jan 25 '15 at 10:11