0

I have an XML file as follows:

<?xml version="1.0" encoding="utf-8"?>
<?xml-stylesheet type="text/xsl" href="test.xslt"?>
<results>
    <test name="sentence1">
        <description href="#ömr">
            ömr1, ämr1, ümr1 and pär1
        </description>
    </test>
    <test name="sentence2" href="#pär2">
        <description>
            ömr2, ämr2, ümr2 and pär2
        </description>
    </test>
    <test name="sentence3" href="#pär3">
        <description>
            ömr3, ämr3, ümr3 and pär3
        </description>
    </test>
</results>

Then here is the XSLT

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0"     xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:b="http://www.froglogic.com/XML2" 
xmlns:xs="http://www.w3.org/2001/XMLSchema" exclude-result-prefixes="xs" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<xsl:output method="html" version="5.0" encoding="UTF-8" indent="yes"/>
<xsl:template match="Summary/test">
    <html>
        <body>
            <xsl:for-each select="//test">
                <xsl:variable name="linkMe" select="@name"/>
                    <xsl:value-of select="description"/>
                    <a href="#{$linkMe}" >
                        <xsl:value-of select="$linkMe" />
                    </a>                        
                <xsl:value-of select="description"/>
            </xsl:for-each>
        </body>
    </html>
</xsl:template>

I want to convert the XML to an HTML file using Perl. But it's going to have not desired output although I have told Perl I want output as a UTF-8.

The perl code is like this:

use strict;
use warnings;

use XML::LibXML;
use XML::Writer;
use XML::LibXSLT;
use XML::Parser;

use Encode qw( is_utf8 encode decode );

my $XML_File  = "test2.xml";
my $XSLT_File = "test2.xslt";
my $HTML_File = "test2.html";

sub XML2HTML {

    my $xml_parser  = XML::LibXML->new('1.0', 'UTF-8');
    my $xslt_parser = XML::LibXSLT->new('1.0', 'UTF-8');
    my $xml         = $xml_parser->parse_file($XML_File);

    $xml->setEncoding('UTF-8');

    my $xsl         = $xml_parser->parse_file($XSLT_File);
    my $stylesheet  = $xslt_parser->parse_stylesheet($xsl);
    my $results     = $stylesheet->transform($xml);
    my $output      = $stylesheet->output_string($results);

    $stylesheet->output_file($results, $HTML_File);
  }

  &XML2HTML($XML_File, $XSLT_File, $HTML_File);

Another question is how I could have UTF-8-BOM output as file? I searched the internet and could not find an exact answer. They all mention UTF-8 rather than UTF-8-BOM.

The HTML output seems unpleasant:

ömr1, ämr1, ümr1 and pär1 ömr2, ämr2, ümr2 and pär2 ömr3, ämr3, ümr3 and pär3 

The encoding format in HTML is

Codepage 1252(Western)

and it is strange!

Borodin
  • 126,100
  • 9
  • 70
  • 144
Royeh
  • 161
  • 1
  • 11

1 Answers1

3

First, you have a subroutine which operates on global variables. That is not a good idea. Instead, those values as arguments to the function so your function is not tied to names you use in other places in your program.

Second, you do not do anything with $output, but storing the output in it will still increase the memory footprint of your program.

Third, looking at the underlying XS code for write_file, we see:

xsltSaveResultToFilename(filename, doc, self, 0);

And, xsltSaveResultToFilename is documented here. Looking at the source code for xsltSaveResultToFilename, we note that the routine deduces the output encoding from the stylesheet. So, the problem has to lie elsewhere.

It turns out, my initial diagnosis was incorrect. After getting my hands on a system with the necessary libraries, I ran your script (which revealed syntax errors in your XSL file -- don't post code we cannot run). After fixing those, I realized the code was producing UTF-8 encoded output, but the HTML did not include a declaration of document encoding. Therefore, when I viewed in my browser, it tried to use Windows 1252. Your XSL template needs to declare the encoding of the HTML document as well. Of course, if add the BOM, you probably don't need the declaration in the head of the document.

The following script seems to work for me:

use strict;
use warnings;

use autouse Carp => 'croak';

use File::BOM ();
use XML::LibXML;
use XML::LibXSLT;

xml_to_html('test.xml', 'test.xsl', 'test.html');

sub xml_to_html {
    my ($xml_file, $xsl_file, $html_file) = @_;

    open my $out, '>:unix', $html_file
        or croak "Failed to open '$html_file': $!";

    print $out $File::BOM::enc2bom{'UTF-8'}
        or croak "Failed to write UTF-8 BOM: $!";

    my $xslt_parser = XML::LibXSLT->new;
    my $xml_parser  = XML::LibXML->new;

    my $xml = $xml_parser->parse_file( $xml_file );
    my $xsl = $xml_parser->parse_file( $xsl_file );
    my $style = $xslt_parser->parse_stylesheet( $xsl );
    my $results = $style->transform( $xml );

    $style->output_fh( $results, $out );
    return;
}

with this template:

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0"
    xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:b="http://www.froglogic.com/XML2"
    xmlns:xs="http://www.w3.org/2001/XMLSchema"
    exclude-result-prefixes="xs" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">

<xsl:output method="html" version="5.0" encoding="UTF-8" indent="yes"/>

<xsl:template match="/">
    <html>
        <head>
            <meta http-equiv="Content-Type" content="text/html; charset=utf-8"/>,
        </head>
        <body>
            <xsl:for-each select="//test">
                <xsl:variable name="linkMe" select="@name"/>
                    <xsl:value-of select="description"/>
                    <a href="#{$linkMe}" >
                        <xsl:value-of select="$linkMe" />
                    </a>                        
                <xsl:value-of select="description"/>
            </xsl:for-each>
        </body>
    </html>
</xsl:template>
</xsl:stylesheet>

and produces the following output:

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html xmlns:b="http://www.froglogic.com/XML2" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">,
        </head>
<body>
            ömr1, ämr1, ümr1 and pär1
        <a href="#sentence1">sentence1</a>
            ömr1, ämr1, ümr1 and pär1

            ömr2, ämr2, ümr2 and pär2
        <a href="#sentence2">sentence2</a>
            ömr2, ämr2, ümr2 and pär2

            ömr3, ämr3, ümr3 and pär3
        <a href="#sentence3">sentence3</a>
            ömr3, ämr3, ümr3 and pär3
        </body>
</html>

I have

$ pacman -Ss libxslt
extra/libxslt 1.1.29+42+gac341cbd-1 [installed]
     XML stylesheet transformation library

which does not seem to include support for generating HTML5 doctype.

Depending on your specific needs, you may have to tweak the XSLT file further.

Sinan Ünür
  • 116,958
  • 15
  • 196
  • 339