1

How do I validate an XML document using XML::LibXML when the DTD is available over HTTPS?

Test code:

#!/usr/bin/perl -w

use XML::LibXML;

use strict;

my $xml = XML::LibXML->load_xml(IO => \*DATA);
my $dtd = XML::LibXML::Dtd->new( "-//NLM//DTD LinkOut 1.0//EN", "https://www.ncbi.nlm.nih.gov/projects/linkout/doc/LinkOut.dtd" );
my $https_is_valid = $xml->is_valid( $dtd );
print "HTTPS dtd: ", ref $dtd, "\n Is valid: $https_is_valid\n";

my $dtd_http = XML::LibXML::Dtd->new( "-//NLM//DTD LinkOut 1.0//EN", "http://www.ncbi.nlm.nih.gov/projects/linkout/doc/LinkOut.dtd" );
my $http_is_valid = $xml->is_valid( $dtd_http );
print "HTTP dtd: ", ref $dtd_http, "\n Is valid: $http_is_valid\n";

__DATA__
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!DOCTYPE LinkSet PUBLIC "-//NLM//DTD LinkOut 1.0//EN" "https://www.ncbi.nlm.nih.gov/projects/linkout/doc/LinkOut.dtd" [
<!ENTITY base.url "https://some.domain.com">
<!ENTITY icon.url "https://some.domain.com/logo.png">
]>
<LinkSet>
  <Link>
    <LinkId>1</LinkId>
    <ProviderId>XXXX</ProviderId>
    <IconUrl>&icon.url;</IconUrl>
    <ObjectSelector>
      <Database>PubMed</Database>
      <ObjectList>
        <ObjId>1234567890</ObjId>
      </ObjectList>
    </ObjectSelector>
    <ObjectUrl>
      <Base>&base.url;</Base>
      <Rule>/1/</Rule>
    </ObjectUrl>
  </Link>
</LinkSet>

The code above produces the following output:

HTTPS dtd:
  Is valid: 0
HTTP dtd: XML::LibXML::Dtd
  Is valid: 1

The DTD fails to load from the HTTPS URL, and therefore cannot be used to validate the XML.

I've downloaded the DTD over HTTPS and checked for HTTP redirects - there aren't any.

I've also had a look at XML::LibXML::InputCallback but can't see how I can incorporate it with XML::LibXML::Dtd->new( ... );.

How should I implement this validation?

The DTD is available over HTTP so I could just use that to validate, but this feels like I'm avoiding the problem rather than solving it properly!

Community
  • 1
  • 1
  • I can reproduce. But note that you can simplify the example: The issue is that `Dtd->new(…)` doesn't seem to work with HTTPS. That validation fails is then a consequence of that, and adds no further information. – amon Feb 12 '18 at 16:56
  • You could always download the DTD yourself (e.g. using LWP) and use `->parse_string($downloaded_dtd)` instead of `->new($url)`. – ikegami Feb 12 '18 at 17:02
  • 1
    Seems to be a central problem in libxml2. See: [Bug 791220 - xmllint and https support](https://bugzilla.gnome.org/show_bug.cgi?id=791220), [gentoo/dotnet: libxml2 doesn't support https](https://github.com/gentoo/dotnet/issues/178), or [How to validate metadata.xml against .dtd in gentoo?](https://stackoverflow.com/q/35530009/1521179) (Ignore Gentoo, this is about the libxml2 library.) – amon Feb 12 '18 at 17:16
  • @amon - I'd seen the libxml2 issues before I started looking at `XML::LibXML::InputCallback` - but forgot to mention it. @ikegami - if there are external references in the DTD, would this break? – jesusbagpuss Feb 12 '18 at 17:33
  • 2
    Ideally your validation code shouldn't be downloading the DTD every time anyway. There's a thing called [XML Catalog](https://en.wikipedia.org/wiki/XML_Catalog) that provides a standard way for applications requesting a DTD to map URLs to a local file instead. For example on a Debian system the facility is provided by the `xml-core` package and it's configured via files in /etc/xml. – Grant McLean Feb 12 '18 at 20:56
  • Thanks @GrantMcLean (something else for me to get my head around sometime soon). Does this help in this situation - e.g. does the XML Catalog get queried by libxml2 before it baulks on having an HTTPS URL to deal with? – jesusbagpuss Feb 15 '18 at 21:50
  • 1
    Yes, according to the [libxml docs](http://xmlsoft.org/catalog.html) the system catalog will be checked before making a network request. So if the catalog has a mapping the file will be retrieved directly and no network request will happen. – Grant McLean Feb 16 '18 at 01:58

1 Answers1

2

Note that the XML already contains the URL to the DTD, so you don't need to create a XML::LibXML::Dtd to pass to ->is_valid.

I agree with commenter Grant McLean that you might not want to go out on the network all the time. In fact, a while back I wrote some code that used a XML::LibXML::InputCallback to redirect all network requests to the local FS where I had cached network resources.

But to answer your question, it wasn't too difficult to adapt that code to fetch from the network, including HTTPS, via HTTP::Tiny, which needs IO::Socket::SSL >=1.56 and Net::SSLeay >=1.49 installed for SSL support. The following prints the expected "Is valid: yes":

use warnings;
use strict;
use XML::LibXML;
use HTTP::Tiny;
use URI;

my $parser = XML::LibXML->new;
my $cb = XML::LibXML::InputCallback->new;
my $http = HTTP::Tiny->new;
my %cache;
$cb->register_callbacks([
    sub { 1 }, # match (URI), returns Bool
    sub { # open (URI), returns Handle
        my $uri = URI->new($_[0]);
        my $file;
        #warn "Handling <<$uri>>\n"; #Debug
        if (!$uri->scheme) { $file = $_[0] }
        elsif ($uri->scheme eq 'file') { $file = $uri->path }
        elsif ($uri->scheme=~/\Ahttps?\z/i) {
            if (!defined $cache{$uri}) {
                my $resp = $http->get($uri);
                die "$uri: $resp->{status} $resp->{reason}\n"
                    unless $resp->{success};
                $cache{$uri} = $resp->{content};
            }
            $file = \$cache{$uri};
        }
        else { die "unsupported URL scheme: ".$uri->scheme }
        open my $fh, '<', $file or die "$file: $!";
        return $fh;
    },
    sub { # read (Handle,Length), returns Data
        my ($fh,$len) = @_;
        read($fh, my $buf, $len);
        return $buf;
    },
    sub { close shift } # close (Handle)
]);
$parser->input_callbacks($cb);

my $doc = $parser->load_xml( IO => \*DATA );
print "Is valid: ", $doc->is_valid ? "yes" : "no", "\n";

__DATA__
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!DOCTYPE LinkSet PUBLIC "-//NLM//DTD LinkOut 1.0//EN" "https://www.ncbi.nlm.nih.gov/projects/linkout/doc/LinkOut.dtd" [
<!ENTITY base.url "https://some.domain.com">
<!ENTITY icon.url "https://some.domain.com/logo.png">
]>
<LinkSet>
  <Link>
    <LinkId>1</LinkId>
    <ProviderId>XXXX</ProviderId>
    <IconUrl>&icon.url;</IconUrl>
    <ObjectSelector>
      <Database>PubMed</Database>
      <ObjectList>
        <ObjId>1234567890</ObjId>
      </ObjectList>
    </ObjectSelector>
    <ObjectUrl>
      <Base>&base.url;</Base>
      <Rule>/1/</Rule>
    </ObjectUrl>
  </Link>
</LinkSet>
haukex
  • 2,973
  • 9
  • 21
  • 1
    I think this fits the bill for what I need. I don't want to rely on other things (XML Catalogue) being on the server, so this gives the best solution. – jesusbagpuss Feb 15 '18 at 21:43