1

I have a bit of an odd issue that I've run into that I'm sure is an encoding error, but in troubleshooting that error PHP is displaying an odd behavior I'm hoping someone can help me make sense of.

I have some xml that is being generated via XQuery:

<?xml version="1.0" encoding="UTF-8"?>
<list>
   <item>
      <orig>London, British Library Harley 2251: <ref target="Quis_Dabit/British_Library_Harley_2251/British_Library_Harley_2251_f42v.html">
            <orig xmlns="http://www.tei-c.org/ns/1.0">O alle ye doughtres · of Jerusalem</orig>
         </ref>
      </orig>
   </item>
   <item>
      <orig>London, British Library Harley 2255: <ref target="Quis_Dabit/British_Library_Harley_2255/British_Library_Harley_2255_f67r.html">
            <orig xmlns="http://www.tei-c.org/ns/1.0">
               <hi rend="blue_pilcrow">¶</hi>O alle ye douħtren of <hi rend="underline">ierusaleem</hi>
            </orig>
         </ref>
      </orig>
   </item>
   <item>
      <orig>Long Melford, Holy Trinity Church Clopton Chantry Chapel: <ref target="Quis_Dabit/Clopton/ww_qd_2.html">
            <orig xmlns="http://www.tei-c.org/ns/1.0">
               <hi>O</hi> alle ye <gap quantity="8" unit="chars" reason="illegible"/>s of ierusaleem</orig>
         </ref>
      </orig>
   </item>
   <item>
      <orig>Cambridge, Jesus College Q.G.8: <ref target="Quis_Dabit/Jesus_College_Q_G_8/Jesus_Q_G_8_f20r.html">
            <orig xmlns="http://www.tei-c.org/ns/1.0">
               <hi>A</hi>ll the <hi rend="underline">doughtren </hi>of <hi rend="underline">Ierusalem</hi> .</orig>
         </ref>
      </orig>
   </item>
   <item>
      <orig>Oxford, Bodleian Library Laud 683: <ref target="Quis_Dabit/Laud_683/Laud_683_f78v.html">
            <orig xmlns="http://www.tei-c.org/ns/1.0">O alle ẏe douhtren of jerusaleem</orig>
         </ref>
      </orig>
   </item>
   <item>
      <orig>Oxford, St. John's College 56: <ref target="Quis_Dabit/St_John_56/St_John_56_73v.html">
            <orig xmlns="http://www.tei-c.org/ns/1.0">O alle the doughtren / of Jerusalem ؛</orig>
         </ref>
      </orig>
   </item>
</list>

I then import it into php:

$text = exec ("java -cp saxon9he.jar net.sf.saxon.Query -t -q:test.xq");

$xml = new DOMDocument;
$xml->loadXML($text);

$xsl = new DOMDocument;
$xsl->load('comparison.xsl');

// Configure the transformer
$proc = new XSLTProcessor;
$proc->importStyleSheet($xsl); // attach the xsl rules

echo $proc->transformToXML($xml);

and attach an xsl stylesheet to it.

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
    xmlns:xs="http://www.w3.org/2001/XMLSchema"
    exclude-result-prefixes="xs"
    version="1.0">
<xsl:output method="html" encoding="UTF-8"/>
<xsl:template match="list">
    <div class="comparison">
        <ul>
            <xsl:apply-templates/>
        </ul>
    </div>
</xsl:template>
<xsl:template match="item">
    <li>
        <xsl:apply-templates/>
    </li>
</xsl:template>

However, when I do so the encoding of the resulting output gets weird on non-standard characters, as seen here:

screen shot showing how the code displays on safari

My assumption was that it is an encoding issue with the results, so I added a print_r statement to show me both the raw xml generated and the DOM tree, then refreshed.

screen shot showing the odd behavior with the print_r statement invoked

I don't doubt it's an encoding error and I plan on tracking it down, but what I want to know is why it displays correctly if I add a print_r statement, but does not if I don't. Is there something I should add to the php file that I haven't? Thanks!

medievalmatt
  • 427
  • 2
  • 12

1 Answers1

1

Your XSLT is outputting HTML encoded as UTF-8, but nothing in your PHP suggests a similar encoding, so it is likely that it defaults to the system encoding, which is probably CP1252 or ISO-8859-1 on Windows and Mac OS Roman on MacOS based computers.

The reason you are getting a somewhat readable output with print_r is that that instruction is trying its best to take the UTF-8 string and print it using the default encoding.

To see this effect, in your browser, force the encoding by clicking View > Encoding > Unicode (actual location of this menu is different per browser). After manually switching to Unicode, you should see the proper text.

Next step is to fix the output encoding of your PHP script. It must first-and-foremost instruct the browser that your page is UTF-8 and not ISO-8859-1. This post explains how to set the output encoding with PHP. The second answer there may also be needed to force PHP to use UTF-8 for any and all output statements.

Community
  • 1
  • 1
Abel
  • 56,041
  • 24
  • 146
  • 247