0

Using PHP to illustrate: there are a BUG in the normalizeDocument() method, or a lack of a "refresh" method, because DOM consistence is lost after changes (even only attribute changes)... So, any algorithm "with DOM changes" that you implement with LIBXML2 somethimes works and sometimes not, is unpredictable!! (?)

The "refresh" by $doc->LoadXML($doc->saveXML()); is a workaround and lost performance in a flow of work with DOM... A sub-question: all moment I need to refresh DOM?

  $XML = '
  <html>
    <h1>Hello</h1>
    <ol>
        <li>test (no id)</li>
        <li xml:id="i2">test i2</li>
    </ol>
  </html>
  ';
  $doc = new DOMDocument;
  $doc->LoadXML($XML);
  doSomeChange($doc);    // here DOM is modified
  print $doc->saveXML(); // show new DOM state

  $doc->normalizeDocument(); // NOT REFRESHING!?!
  var_dump($doc->getElementById('i2'));  //NULL!??! is a BUG!
  //CAN_NOT_doMORESomeChange($doc);

  $doc->LoadXML($doc->saveXML());        // only way to refresh?
  print $doc->getElementById('i2')->tagName;  //OK, is there

  // illustrating attribute modification:
  function doSomeChange(&$dom) {
    $max = 0;
    $xp  = new DOMXpath($dom);
    foreach(iterator_to_array($xp->query('/html/* | //li')) as $e) {
        $max++;
        $e->setAttribute('xml:id',"i$max");
    }
    print "\ncmpDOM='".($xp->document === $dom)."'\n"; // after @ThomasWeinert
  }

So, input is the $XML and output is

  <html>
            <h1 xml:id="i1">Hello</h1>
            <ol xml:id="i2">
                <li xml:id="i3">test (no id)</li>
                <li xml:id="i4">test i2</li>
            </ol>
        </html>
  NULL
  ol

the NULL is the bug (see code comments).

PS: if I change input line <li xml:id="i2">test i2</li> to <li>test i2</li> the algorithm works as expected (!), so, is unpredictable.


Related questions: In DomDocument, reuse of DOMXpath, it is stable? PHP DomDocument, reuse of XSLTProcessor, it is stable/secure?

Community
  • 1
  • 1
Peter Krauss
  • 13,174
  • 24
  • 167
  • 304
  • It may be that `normalizeDocument` doesn't re-parse xml:id attributes - has this problem occurred with any other DOM manipulation? This specific example can be solved by using `loadHTML($xml)` and `id` (rather than `xml:id`) attributes, without using `normalizeDocument`. – Alf Eaton Nov 21 '13 at 10:20
  • Thanks by your notes. Well, "doesn't re-parse xml:id attributes", why?? Yes, the problem occurs for any other modification (removeChild, replaceChild, etc.): DOM behaviour will be unpredictable. I use procedures with a sequence of many "doMORESomeChange()" functions... My concrete problems is on processing [XML JATS documents](http://jats.nlm.nih.gov/), big documents, so difficult to copy/paste here a test environment... The `xml:id` was used correctly, and is a good example -- I not have now another "so short XML" to show more illustrations. – Peter Krauss Nov 21 '13 at 12:58
  • Despite [the documentation](http://php.net/manual/en/domdocument.normalizedocument.php) saying "This method acts as if you saved and then loaded the document", `normalizeDocument` is really only meant for [collapsing adjacent text nodes](http://php.net/manual/en/domnode.normalize.php#56058). – Alf Eaton Nov 21 '13 at 14:30
  • ... Thanks (!). Hum, yes, so, there are no internal method for "refresh DOM": only the brute force of saveXML/loadXML (and need for reconfigure DOM properties). – Peter Krauss Nov 21 '13 at 15:08

1 Answers1

0

Changes are applied to the DOM the moment you're doing them. In your example this creates a status where two elements have the same xml:id and this seems to screw up the index. Remove the xml:id attributes before setting them and it works:

$XML = '
  <html>
    <h1>Hello</h1>
    <ol>
        <li>test (no id)</li>
        <li xml:id="i2">test i2</li>
    </ol>
  </html>
  ';
  $doc = new DOMDocument;
  $doc->loadXML($XML);
  var_dump($doc->getElementById('i2'), $doc->getElementById('i2')->tagName);
  /*
    object(DOMElement)#2 (0) { }
    string(2) "li"
  */

  doSomeChange($doc);    // here DOM is modified

  var_dump($doc->getElementById('i2'), $doc->getElementById('i2')->tagName);
  /*
    object(DOMElement)#6 (0) { }
    string(2) "ol"
  */

  print $doc->saveXML(); // show new DOM state
  /*
  <?xml version="1.0"?>
  <html>
    <h1 xml:id="i1">Hello</h1>
    <ol xml:id="i2">
      <li xml:id="i3">test (no id)</li>
      <li xml:id="i4">test i2</li>
    </ol>
  </html>
  */

  // illustrating xml:id attribute modification:
  function doSomeChange($dom) {
    $xp  = new DOMXpath($dom);
    foreach($xp->evaluate('//*') as $e) {
      $e->removeAttribute('xml:id');
    }
    $max = 0;
    foreach($xp->evaluate('/html/*|//li') as $e) {
      $max++;
      $e->setAttribute('xml:id',"i$max");
    }
  }

Your specific dom modification is, what breaks the getElementById() calls.

To the "stability" question: The connection between DOMXpath and DOMDocument is not completly "stable". If you're using a load*() method in the DOMDocument, the connection is lost. You can validate that the DOMXpath uses the correct DOMDocument comparing its document property:

var_dump($xpath->document === $doc);

This does not happen in your case, because you always create a new instance of DOMXpath in the function. But it means you should avoid reloading the document because this will break xpath instances created for the document.

ThW
  • 19,120
  • 3
  • 22
  • 44
  • Good clues, thanks! I edited the code of the question (see line with your username): your clue about "how to check if need to refresh" not works, results in `cmpDOM=1` even when bug reported. – Peter Krauss Nov 21 '13 at 14:23
  • This answer is correct. If you add `$doc->getElementById('i2')->removeAttribute('xml:id');` before calling `doSomeChange($doc)`, it works as expected (and there's no need to use `$doc->normalizeDocument()`). – Alf Eaton Nov 21 '13 at 14:35
  • Yes, I agree about your solution with previous `removeAttribute`, but for me it is an artificial workaround, because not solves all contexts and cases where I must decide to use another valid (perhaps the unique) solution that is "refresh DOM". What condition or what "trigger" I can use to check if the DOM was "currupted" (unpredicable)?? – Peter Krauss Nov 21 '13 at 14:45
  • The problem is that you try to assign an xml:id to an element that is already assigned to another element. This breaks it! You misunderstood my comment about DOMXpath, if you "refresh" (save and load) the document, the connection to the DOMXpath instance is broken. You need to create an new DOMXpath instance for the document in this case. – ThW Nov 21 '13 at 14:47
  • You say that was only my "loop problem" (an element that is already assigned to another element), and perhaps this is the only problem there -- I not see because saveXML show all xml:id's and DOM not reflects the showed XML string. Ok. But I have similar instabilities in a big library processing big documents... Need some safe and generic "danger-checker"... So, that is my concrete problem (need for generic checking of "DOM is currupted" or "index corrupted", warning or error message). – Peter Krauss Nov 21 '13 at 14:52