2

Since libxml 2.9, loading external entities has been disabled when parsing XML, to prevent XXE attacks.

In that case, to be able to load a DTD file when parsing the XML with PHP's DOMDocument, LIBXML_DTDLOAD must be specified.

What would be a good way to verify that only the expected DTD will be loaded, before enabling LIBXML_DTDLOAD?

One approach I can think of (as shown in the example code below) would be to keep entity loading disabled, parse the XML file once, check that the DOCTYPE declaration is as expected, then parse the XML again with entity loading enabled. Would that be sufficient?

<?php

$xml = <<<XML
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Publishing DTD v1.0 20120330//EN" "http://jats.nlm.nih.gov/publishing/1.0/JATS-journalpublishing1.dtd">
<article/>
XML;

// entity loading disabled

libxml_disable_entity_loader();

$doc = new DOMDocument;
$doc->loadXML($xml, LIBXML_DTDLOAD); // PHP Warning:  DOMDocument::load(): I/O warning : failed to load external entity

print $doc->doctype->systemId; // http://jats.nlm.nih.gov/publishing/1.0/JATS-journalpublishing1.dtd

// entity loading enabled

libxml_disable_entity_loader(false);

$doc = new DOMDocument;
$doc->loadXML($xml, LIBXML_DTDLOAD);

print $doc->doctype->systemId; // http://jats.nlm.nih.gov/publishing/1.0/JATS-journalpublishing1.dtd
Alf Eaton
  • 5,226
  • 4
  • 45
  • 50
  • Sufficient for what? Could you please make this a concrete programming question? -- Possible duplicate of: [Clarifications on XXE vulnerabilities throughout PHP versions](http://stackoverflow.com/q/24117700/367456) (answer pending) – hakre Jul 06 '14 at 18:52
  • The question, as stated, is "What would be a good way to verify that _only_ the expected DTD will be loaded, before enabling LIBXML_DTDLOAD?" – Alf Eaton Jul 07 '14 at 10:02
  • (which implies): Are those DTDs actually loaded if you use `LIBXML_DTDLOAD`? Have you tested? Also please provide example data and code in your question. We need an example here for reproduction and clarity - at least if you want a sufficient and clear answer. According to my tests in the linked question, those aren't loaded regardless of your setting. But I'm not sure about the stability of that test. – hakre Jul 07 '14 at 10:39
  • I've added example code to the question. – Alf Eaton Jul 07 '14 at 18:35
  • +1 for the snappy example :) – hakre Jul 08 '14 at 12:20

2 Answers2

2

What would be a good way to verify that only the expected DTD will be loaded, before enabling LIBXML_DTDLOAD?

If you want to filter (whitelist) expected DTDs, you can do so by disabling all others by returning NULL from your own callable that has been set as external entity loader via libxml_set_external_entity_loader.

That is, you would use the LIBXML_DTDLOAD flag and then resolve to a resource handle in your function in case the DTD is white-listed. In case not, you return said NULL.

<?php
/**
 * @link http://stackoverflow.com/q/24526493/367456
 */

$xml = <<<XML
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Publishing DTD v1.0 20120330//EN" "http://jats.nlm.nih.gov/publishing/1.0/JATS-journalpublishing1.dtd">
<article/>
XML;

/* own entity loader */
libxml_set_external_entity_loader(function() {
  var_dump(func_get_args()); // just for demonstrating purposes
  return NULL;
});

$doc = new DOMDocument;
$doc->loadXML($xml, LIBXML_DTDLOAD);

echo "----\n";

/* restore default entity loader */    
libxml_set_external_entity_loader(NULL);

$doc = new DOMDocument;
$doc->loadXML($xml, LIBXML_DTDLOAD);

Example output:

array(3) {
  [0]=>
  string(66) "-//NLM//DTD JATS (Z39.96) Journal Publishing DTD v1.0 20120330//EN"
  [1]=>
  string(66) "http://jats.nlm.nih.gov/publishing/1.0/JATS-journalpublishing1.dtd"
  [2]=>
  array(4) {
    ["directory"]=>
    string(1) "/"
    ["intSubName"]=>
    string(7) "article"
    ["extSubURI"]=>
    string(66) "http://jats.nlm.nih.gov/publishing/1.0/JATS-journalpublishing1.dtd"
    ["extSubSystem"]=>
    string(66) "-//NLM//DTD JATS (Z39.96) Journal Publishing DTD v1.0 20120330//EN"
  }
}

Warning: DOMDocument::loadXML(): Failed to load external entity "-//NLM//DTD JATS (Z39.96) Journal Publishing DTD v1.0 20120330//EN" in Entity, line: 2 in /in/jemmH on line 18
----

Warning: DOMDocument::loadXML(): php_network_getaddresses: getaddrinfo failed: Name or service not known in /in/jemmH on line 25

Warning: DOMDocument::loadXML(http://jats.nlm.nih.gov/publishing/1.0/JATS-journalpublishing1.dtd): failed to open stream: php_network_getaddresses: getaddrinfo failed: Name or service not known in /in/jemmH on line 25

Notice: DOMDocument::loadXML(): failed to load external entity "http://jats.nlm.nih.gov/publishing/1.0/JATS-journalpublishing1.dtd" in Entity, line: 2 in /in/jemmH on line 25
hakre
  • 193,403
  • 52
  • 435
  • 836
  • Thanks, that seems like a good approach. Do you know how to restore the default external entity loader after the XML has been parsed? – Alf Eaton Jul 08 '14 at 07:20
  • 1
    There is no need to restore the external entit loader normally, as you have alread taken care of the loading, as you mimicked the default behavior already for the whitelisted DTDs. However if you really want to restore the default handler, call `libxml_set_external_entity_loader(NULL);`. An example is here: http://3v4l.org/jemmH – hakre Jul 08 '14 at 08:23
-1

Your approach is seems good but in order to improve performence you may want to do White Listing on DOCTYPE declarations, if it pass then you can parse it with entity loading enabled.

Mehmet Ince
  • 1,298
  • 9
  • 20