Since libxml 2.9, loading external entities has been disabled when parsing XML, to prevent XXE attacks.
In that case, to be able to load a DTD file when parsing the XML with PHP's DOMDocument, LIBXML_DTDLOAD
must be specified.
What would be a good way to verify that only the expected DTD will be loaded, before enabling LIBXML_DTDLOAD
?
One approach I can think of (as shown in the example code below) would be to keep entity loading disabled, parse the XML file once, check that the DOCTYPE declaration is as expected, then parse the XML again with entity loading enabled. Would that be sufficient?
<?php
$xml = <<<XML
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Publishing DTD v1.0 20120330//EN" "http://jats.nlm.nih.gov/publishing/1.0/JATS-journalpublishing1.dtd">
<article/>
XML;
// entity loading disabled
libxml_disable_entity_loader();
$doc = new DOMDocument;
$doc->loadXML($xml, LIBXML_DTDLOAD); // PHP Warning: DOMDocument::load(): I/O warning : failed to load external entity
print $doc->doctype->systemId; // http://jats.nlm.nih.gov/publishing/1.0/JATS-journalpublishing1.dtd
// entity loading enabled
libxml_disable_entity_loader(false);
$doc = new DOMDocument;
$doc->loadXML($xml, LIBXML_DTDLOAD);
print $doc->doctype->systemId; // http://jats.nlm.nih.gov/publishing/1.0/JATS-journalpublishing1.dtd