6

I have a field in MySQL of type text, using the following collation: utf8_general_ci.

This XML field is populated using a variable built using DOMDocument:

function ed_audit_node($dom, $field, $new, $old){

    //create audit_detail node
    $ad = $dom->createElement('audit_detail');

    $fn = $dom->createElement('fieldname');
    $fn->appendChild($dom->createTextNode($field));
    $ad->appendChild($fn);

    $ov = $dom->createElement('old_value');
    $ov->appendChild($dom->createTextNode($old));
    $ad->appendChild($ov);

    $nv = $dom->createElement('new_value');
    $nv->appendChild($dom->createTextNode($new));
    $ad->appendChild($nv);

    //append to document
    return $ad;
}

Here's how I save to the db ( $xml comes from $dom->saveXML() ):

function ed_audit_insert($ed, $xml){
    global $visitor;

    $sql = <<<EOF
    INSERT INTO ed.audit
    (employee_id, audit_date, audit_action, audit_data, user_id) 
    VALUES (
        {$ed[emp][employee_id]}, 
        now(), 
        '{$ed[audit_action]}', 
        '{$xml}', 
        {$visitor[user_id]}
    );      
EOF;
    $req = mysql_query($sql,$ed['db']) or die(db_query_error($sql,mysql_error(),__FUNCTION__));
//snip  
}

See an older, parallel, slightly related thread on how I’m creating this XML: Another PHP XML parsing error: "Input is not proper UTF-8, indicate encoding!"

What works: - querying the database, selecting the field and outputting it using jQuery (.ajax()) and populating a textarea. Firebug and the textarea match what's in the database (confirmed with Toad).

What doesn't work: - outputting the text from the database into an HTML page. This HTML page has the content-type ISO-8859-1, which I cannot change.

Here’s the code that outputs that to the screen:

$xmlData = simplexml_load_string($d['audit_data']);

foreach ($xmlData->audit_detail as $a){
    echo "<p> straight from db = ".$a->new_value."</p>";
    echo "<p> utf8_decode() = ".utf8_decode($a->new_value)."</p>";
} 

I’ve also used a charset changer extension for Firefox: tried ISO-8859-1, UTF-8 and 1252 without success.

If it was UTF-8, shouldn’t I be seeing diamonds with question marks inside (since it's content-type = ISO-8859-1)? If it’s not UTF-8, what is it?

Edit #1

Here's snapshot of other tests that I have made:

$xmlData = simplexml_load_string($d['audit_data']);
foreach ($xmlData->audit_detail as $a){
    echo "<p>encoding is, straight from db, using mb_detect_encoding: ".mb_detect_encoding($a->new_value)."</p>";
    echo "<p>encoding is, with utf8_decode, using mb_detect_encoding: ".mb_detect_encoding(utf8_decode($a->new_value))."</p>";
    echo "<hr/>";
    echo "<p> straight from db = <pre>".$a->new_value."</pre></p>";
    echo "<p> utf8_decode() = <pre>".utf8_decode($a->new_value)."</pre></p>";
    echo "<hr/>";
    $iso88591_2 = iconv('UTF-8', 'ISO-8859-1', $a->new_value);
    $iso88591_3 = mb_convert_encoding($a->new_value, 'ISO-8859-1', 'UTF-8');
    echo "<p> iconv() = ".$iso88591_2."</p>";
    echo "<p> mb_convert_encoding() = ".$iso88591_3."</p>";
}

Edit #2

I added the FF proprietary tag, xmp.

Code:

$xmlData = simplexml_load_string($d['audit_data']);

foreach ($xmlData->audit_detail as $a){
    echo "<p>encoding is, straight from db, using mb_detect_encoding: ".mb_detect_encoding($a->new_value)."</p>";
    echo "<p>encoding is, with utf8_decode, using mb_detect_encoding: ".mb_detect_encoding(utf8_decode($a->new_value))."</p>";
    echo "<hr/>";
    echo "<p> straight from db = <pre>".$a->new_value."</pre></p>";
    echo "<p> utf8_decode() = <pre>".utf8_decode($a->new_value)."</pre></p>";
    echo "<hr/>";
    $iso88591_2 = iconv('UTF-8', 'ISO-8859-1', $a->new_value);
    $iso88591_3 = mb_convert_encoding($a->new_value, 'ISO-8859-1', 'UTF-8');
    echo "<p> iconv() = ".$iso88591_2."</p>";
    echo "<p> mb_convert_encoding() = ".$iso88591_3."</p>";
    echo "<hr/>";
    echo "<p>straight from db, using &lt;xmp&gt;  = <xmp>".$a->new_value."</xmp></p>";
    echo "<p>utf8_decode(), using &lt;xmp&gt; = <xmp>".utf8_decode($a->new_value)."</xmp></p>";

}

Here are some meta tags from the page:

<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" />
<meta name="dc.language" scheme="ISO639-2/T" content="eng" />

IMO, the last meta tag has no bearing.

Edit #3

Source code:

<p>encoding is, straight from db, using mb_detect_encoding: UTF-8</p><p>encoding is, with utf8_decode, using mb_detect_encoding: ASCII</p><hr/><p> straight from db = <pre>Ro马eç ³é ¥n franê¡©s</pre></p><p> utf8_decode() = <pre>Ro?e??n fran?s</pre></p><hr/><p> iconv() = Ro</p><p> mb_convert_encoding() = Ro?e??n fran?s</p><hr/><p>straight from db, using &lt;xmp&gt;  = <xmp>Ro马eç ³é ¥n franê¡©s</xmp></p><p>utf8_decode(), using &lt;xmp&gt; = <xmp>Ro?e??n fran?s</xmp></p>

Edit #4

Here is the SQL statement going in to the db:

INSERT INTO ed.audit
    (employee_id, audit_date, audit_action, audit_data, user_id) 
    VALUES (
        75, 
        now(), 
        'u', 
        '<?xml version="1.0"?>
<audit><audit_detail><fieldname>role_fra</fieldname><old_value>aRo&#x9A6C;e&#x7833;&#x9825;n fran&#xA869;s</old_value><new_value>bRo&#x9A6C;e&#x7833;&#x9825;n fran&#xA869;s</new_value></audit_detail></audit>
', 
        333
    );

! Note, the text from this XML doesn't necessarily match the screenshots provided above.

Edit #5

Here's my new function that wraps the CDATA tag around my values for the old_value and new_value nodes:

function ed_audit_node($dom, $field, $new, $old){

    //create audit_detail node
    $ad = $dom->createElement('audit_detail');

    $fn = $dom->createElement('fieldname');
    $fn->appendChild($dom->createTextNode($field));
    $ad->appendChild($fn);

    $ov = $dom->createElement('old_value');

    $ov->appendChild($dom->createCDATASection($old));
    $ad->appendChild($ov);

    $nv = $dom->createElement('new_value');
    $nv->appendChild($dom->createCDATASection($new));
    $ad->appendChild($nv);

    //append to document
    return $ad;
}

I also added the encoding to the XML document:

$dom = new DomDocument('1.0', 'UTF-8');

Here's my new simpleXML call:

$xmlData = simplexml_load_string($d['audit_data'], "SimpleXMLElement", LIBXML_NOENT | LIBXML_NOCDATA);

I see the CDATA tags in Toad as well. However, I'm still getting an error:

Warning: simplexml_load_string() [function.simplexml-load-string]: Entity: line 2: parser error : Input is not proper UTF-8, indicate encoding ! Bytes: 0xE9 0xE9 0x6C 0x65 in <snip>

Edit #6

I just noticed that the jQuery call returns the proper accented characters in the CDATA.

Community
  • 1
  • 1
TechFanDan
  • 3,329
  • 6
  • 46
  • 89

1 Answers1

1

Technically your string is in UTF8, but the HTML encoded characters (when rendered by the browser) are not in UTF8. So &#xa869; is a valid UTF8 String, but the character that is rendered onto the screen from the web browser is not valid UTF8.

I would also wrap your echo to the screen (last 2 lines in your example) like this:

echo "<p>straight from db = <xmp>".$a->new_value."</xmp></p>";
echo "<p>utf8_decode() = <xmp>".utf8_decode($a->new_value)."</xmp></p>";

This will clearly display the point i am making above.

Edit:

The problem actually is an uncontrollable undocumented "Feature" in simplexml_load_string() for PHP. It will automatically convert all characters from their XML entity form strait into their actual char form. The only way to circumvent this is to use simplexml_load_string() like this:

 $data = simplexml_load_string(
      '<?xml version="1.0" encoding="utf-8"?> 
           <audit>
                <audit_detail>
                     <fieldname>role_fra</fieldname>
                     <old_value><![CDATA[aRo&#x9A6C;e&#x7833;&#x9825;n fran&#xA869;s]]></old_value>
                     <new_value><![CDATA[bRo&#x9A6C;e&#x7833;&#x9825;n fran&#xA869;s]]></new_value>
                </audit_detail>
           </audit>', 
      "SimpleXMLElement", 
      LIBXML_NOENT | LIBXML_NOCDATA
 );
 print "<PRE>";
 print_r($data);
 exit;

You must wrap your elements in <![CDATA[]]> tags and then pass the LIBXML_NOCDATA option to the xml parser. This will force the things in <![CDATA[]]> tags to be cast as String type and PHP can properly handle that outside of a SimpleXMLObject.

Geoffrey Wagner
  • 818
  • 1
  • 5
  • 11
  • may want to look here: http://stackoverflow.com/questions/374425/convert-utf8-characters-to-iso-88591-and-back-in-php – dqhendricks Jan 13 '11 at 19:02
  • @Geoffrey: even with the pre, the text remains the same as in my example (the font size changes though). I've tried in both IE and FF. @dqhendricks: Thanks, I'll check that out. – TechFanDan Jan 13 '11 at 19:23
  • @tekius You may want to try insteand of <pre> tags. <xmp> will work in FF to display the raw output to you. Does the concept of `ꡣ` being valid UTF8 make sense? Also the `utf8_decode($a-&gt;new_value);` is decoding the decoded information, thus &amp; and # are being turned into a decoded version of themselves, thus all the ???? you see in your output. The <xmp> for raw output should display this. I have updated the code above to reflect the change. – Geoffrey Wagner Jan 13 '11 at 19:27
  • @Geoffrey: I tried your example and I'm not getting the proper output still. I've added a screenshot and code in my OP. Would there be a setting in FF that would be "tinkering" with my output using the xmp tag? – TechFanDan Jan 13 '11 at 19:48
  • @Tekius what OS are you on? Also can you post the strait HTML output of your test page? – Geoffrey Wagner Jan 13 '11 at 19:49
  • @Geoffrey: Win XP SP3 using FF 3.6.13. I've posted a screenshot and the actual HTML output taken from the source viewer (redundant, I know). Would have there been a better way to take the source? I've saved the source to an html file and opened it in notepad which gave the same output. – TechFanDan Jan 13 '11 at 20:04
  • @Tekius The thing that is not matching up for me is, your Toad display and your testing displays are showing different things. If the data was actually entering PHP the same way Toad is displaying it, then you should be seeing HTML encoded characters in FF (and there is no special trigger to make show something different on any OS). Can you post your output of the query before it actually gets inserted into the DB? So a post of the INSERT INTO query directly from the FF source code page. – Geoffrey Wagner Jan 13 '11 at 20:10
  • @Geoffrey: see edit #4 with SQL. Things seem to be going in properly. – TechFanDan Jan 13 '11 at 20:22
  • @Tekius It would appear that the XML is being transformed by tthe simplexml_load_string function, i am working up a fix for you that will not decode the HTML Special Chars. – Geoffrey Wagner Jan 13 '11 at 20:34
  • @Tekius if you wrap your elements in `<![CDATA[]]>` tags and use your simplexml_load_string like this: ` $data = simplexml_load_string(' role_fra<![CDATA[aRo马e砳頥n franꡩs]]><![CDATA[bRo马e砳頥n franꡩs]]>', "SimpleXMLElement", LIBXML_NOENT | LIBXML_NOCDATA);` then everything should work properly. Let me know if this works and i can post it as the answer. – Geoffrey Wagner Jan 13 '11 at 20:47
  • @Geoffrey: I've applied your changes, see edit #5. I'm still getting an error, now when I read in the XML. – TechFanDan Jan 14 '11 at 12:28
  • @Tekius the database is now incorrect. What happened with your dom creator (which i really dont understand why your using XML in PHP and MySQL, JSON would be much better, or creating a table to hold all of your fileds) is it never converted your special chars into their hex format (aka ꡩ) Also, technically the characters you have do not have a strait UTF-8 conversion (because they are not UTF-8 characters) so you will either need to build a conversion table for those characters, or strip out everything that is not UTF-8 from your scripts. – Geoffrey Wagner Jan 14 '11 at 14:22
  • @Tekius Check http://stackoverflow.com/questions/1401317/remove-non-utf8-characters-from-string to see how exactly to identify and remove all of your non-UTF-8 characters. Also i would read this http://en.wikipedia.org/wiki/UTF-8 to have a much better understanding of the UTF-8 range of characters and what characters lay outside of UTF-8 and why you cant just convert them – Geoffrey Wagner Jan 14 '11 at 14:26
  • @Geoffrey: I'll look at JSON, thanks for the tip. Agreed on the fact that this should have a table to hold the fields. – TechFanDan Jan 14 '11 at 15:16
  • @Tekius I just want to make sure you understand this whole character encoding conundrum that we seem to have fallen into here? JSON isnt the end all solution especially if you still start with non-utf-8 characters in your string. – Geoffrey Wagner Jan 14 '11 at 15:18
  • @Geoffrey: I was thinking of removing the utf8 collation from my field and replacing it with the default, storing things as ISO-8859-1 and figured I'd be set. – TechFanDan Jan 14 '11 at 15:35