0

I need only to compare the same DomDocument object before and after some operation, a "fast check" if the operation changed the object... I need some serialize_DomDocument() (exists something like?) that can be used in the following context:

  1. $obj is a DomDocumemt object (the "state of the object" is the value of all properties and all documentElement contents).

  2. $dump = serialize_DomDocument($obj)   How to do it?? How to dump the state of the object?   ... Not translating all to XML by saveXML() method, but only (FASTER!) coping all binary representarion (of the object pointed by $obj) to $dump.

  3. perform some operation (ex. remove a node or change an attribute or "do nothing")

  4. if ($dump != serialize_DomDocument($obj)) or something like it, checking if $obj changed. Fast comparison. The operation was "do nothing" or "do something"?

Alternative solution...

Not the ideal, but solve some cases... There are some operations (ex. only appendChild's or only removeChild's) where the change always affect the number of nodes, so, for that kind of operations, to check total number of nodes or total length, is enough.

How to check faster than saveXML()?


NOTES

NOTE-1: as remembered by this post, you can not serialize a DomDocument Object.

NOTE-2: is not a problem of comparing two distinct DOM objects, but something more easy, because not change IDs, etc. not need canonical representation (!), only access to internal representation.


EDIT AFTER BOUNTY

This question is not about "how controll changes or a change-flag", please read with attention the question.

This question is not a request of theoretical reviews.

EDIT2

Perhaps the solution is about "low level"... I not understand if the "binary representation" is the "dump" or have another name, see:

Community
  • 1
  • 1
Peter Krauss
  • 13,174
  • 24
  • 167
  • 304

4 Answers4

2

Here a "second class" solution, hoping "first class" here.

... in the absence of a direct solution, there are "Alternative solutions", as I expressed in the question's text... And it is waht I am doing today in my application:

with a XPath count(//node()) we can check if changed, faster than saveXML(), but if the DOM have same number of nodes (ex. change attribute values), we need to check by saveXML, spending memory and/or CPU time to generate and to compare XMLs.

// $old_dom is the original document, 
$old_xp = new DOMXpath($old_dom);
// ... holding $old_n (and perhaps $old_xml) in a cache 
$old_n = $old_xp->evaluate('count(//node())');
$old_xml = $old_dom->saveXML();
...

// $new_dom is the modified document, must tested at each "black box" change.
$new_xp = new DOMXpath($new_dom);
$new_n = $new_xp->evaluate('count(//node())');
if ($new_n!=$old_n) {  //  OK, fast!!
  //.. OK! do something because changed!
} // else, check with more detail if changed,
elseif ($new_dom->saveXML()!=$old_xml){ // memory and CPU-time consuming here!
  //.. OK! do something because changed!
}

As @ArtisiticPhoenix suggested, let's compare!

BENCHMARKING by this simple gist and real life XML-documents of PubMed Central:

  $xp = new DOMXpath($dom);      
  $test = ( $xp->evaluate('count(//node())') == 123 );
    // Execution total time of 10000 loops: 0.582 seconds
    // Averaged execution time of each loop: 5.8152985572815E-5 seconds
  
  $xp = new DOMXpath($dom);      
  $test = ( $xp->evaluate('count(//*)') == 123 );
    // Execution total time of 10000 loops: 0.462 seconds
    // Averaged execution time of each loop: 4.6241903305054E-5 seconds

~10x better when comparing with another solutions,

  $test = ( md5( $dom->saveXML() ) == 'ff1afedc7127eb221050fa48eee5153a');
    // Execution total time of 10000 loops: 3.877 seconds
    // Averaged execution time of each loop: 0.00038769061565399 seconds

  $test = ( $dom->saveXML() == $oldXml ); // comparing long strings 
    // Execution total time of 10000 loops: 3.168 seconds
    // Averaged execution time of each loop: 0.00031677310466766 seconds

  $test = ( $dom->C14N() == $oldXml ); // comparing long strings 
    // Execution total time of 10000 loops: 8.874 seconds
    // Averaged execution time of each loop: 0.00088736770153046 seconds

NOTE: the direct longString comparation is faster than shortString by md5. Confirmed with,

  $test = ( $dom->saveXML() == 'xxx' );
    //Execution total time of 10000 loops: 3.14 seconds
    //Averaged execution time of each loop: 0.00031396999359131 seconds

The long/short change, even in the wrost case (comparing same string), not changes the "0.31 microseconds" to perform direct, better than short+MD5's "0.39 microseconds".


What I looking for in this kind of "alternative solution" ...

... is a DomDocument internal property that returns the "DOM length" or "DOM number of nodes", without need of a XPath counting procedure.

PS: a bounty for "alternative solution" is for this kind of answer.

Community
  • 1
  • 1
Peter Krauss
  • 13,174
  • 24
  • 167
  • 304
1

First off I haven't used DOMDocument for much, but I will give it a go ( and please read the full post ).

You can use the C14N() method it's not well documented but from my IDE i get this:

/**
 * Canonicalize nodes to a string
 * @link http://www.php.net/manual/en/domnode.c14n.php
 * @param exclusive bool[optional] <p>
 * Enable exclusive parsing of only the nodes matched by the provided
 * xpath or namespace prefixes.
 * </p>
 * @param with_comments bool[optional] <p>
 * Retain comments in output.
 * </p>
 * @param xpath array[optional] <p>
 * An array of xpaths to filter the nodes by.
 * </p>
 * @param ns_prefixes array[optional] <p>
 * An array of namespace prefixes to filter the nodes by.
 * </p>
 * @return string canonicalized nodes as a string&return.falseforfailure;
*/
public function C14N ($exclusive = null, $with_comments = null, array $xpath = null, array $ns_prefixes = null) {}

I will simply be taking the example of DOMDocument off of the PHP documentation page for this post.

So that said for my examples I have this obj to start with. ( note where I put the for loop in comments, i'll use this latter for benchmarking ):

    $xml = new \DOMDocument( "1.0", "ISO-8859-15" );
    $xml_album = $xml->createElement( "Album" );
    //---- for( $i=0; $i < 10000; $i++ ){  //for benchmarks I will be adding 30,000 nodes, to get something worth measuring performance on.
    // Create some elements.
    $xml_track = $xml->createElement( "Track", "The ninth symphony" );

    // Set the attributes.
    $xml_track->setAttribute( "length", "0:01:15" );
    $xml_track->setAttribute( "bitrate", "64kb/s" );
    $xml_track->setAttribute( "channels", "2" );

    // Create another element, just to show you can add any (realistic to computer) number of sublevels.
    $xml_note = $xml->createElement( "Note", "The last symphony composed by Ludwig van Beethoven." );

    // Append the whole bunch.
    $xml_track->appendChild( $xml_note );
    $xml_album->appendChild( $xml_track );

    // Repeat the above with some different values..
    $xml_track = $xml->createElement( "Track", "Highway Blues" );

    $xml_track->setAttribute( "length", "0:01:33" );
    $xml_track->setAttribute( "bitrate", "64kb/s" );
    $xml_track->setAttribute( "channels", "2" );
    $xml_album->appendChild( $xml_track );

    $xml->appendChild( $xml_album );
    //----- } //end for loop
    // Parse the XML.
    print $xml->saveXML();

Or roughly when we encode it with htmlspecialchars, and a bit of tabbing:

<?xml version="1.0" encoding="ISO-8859-15"?>
<Album>
    <Track length="0:01:15" bitrate="64kb/s" channels="2">The ninth symphony
        <Note>The last symphony composed by Ludwig van Beethoven.</Note>
    </Track>
    <Track length="0:01:33" bitrate="64kb/s" channels="2">Highway Blues</Track>
</Album>

Good so far now using the ( poorly documented C14N() ) gives us this ( minus the nice indenting, and such ), notice they are almost the same but the order is different and we are minus the encoding bit, so we will not want to compare them against each other:

<Album>
    <Track bitrate="64kb/s" channels="2" length="0:01:15">The ninth symphony
        <Note>The last symphony composed by Ludwig van Beethoven.</Note>
    </Track>
    <Track bitrate="64kb/s" channels="2" length="0:01:33">Highway Blues</Track>
</Album>

Now generally this appears to be similar to just the saveXML, but it has a few more options for filtering the output than just saveXML, so I thought I would mention it.

Now I'm not entirely sure why the concern for performance as in my limited testing I took the liberty of looping it 10,000 times for 30,000 nodes ( 20,000 track, 10,000 note nodes and 60,000 attributes ), and even then performance was fairly good giving me these results ( just for the function calls shown below, not generating the DOM contents, as that is a separate concern):

 $xml->saveXML();
'elapsedTime' => '0.10 seconds',
'elapsedMemory' => '0.39 KB'

 $xml->C14N();
'elapsedTime' => '0.15 seconds',
'elapsedMemory' => '0.3 KB'

  /// outputting to the screen should not be tracked - as I show below this will have a slight, but non-zero impact on the performance benchmarks.
  echo $xml->saveXML()
 'elapsedTime' => '0.16 seconds', //+0.06 seconds
 'elapsedMemory' => '0.3 KB'

  echo $xml->C14N();
 'elapsedTime' => '0.21 seconds', //+0.06 seconds again
 'elapsedMemory' => '0.3 KB'

So the performance is slightly less then that of the saveXML, but in both cases I would say for the number of nodes I'm using is very reasonable.

So given we can acceptably use either saveXML or C14N, how can we compare changes to such a large string? We'll as everyone should know, you hash it. Now immediately one will think of md5 but sha1 is actually better here, it gives us a slightly longer hash and the performance difference is negligible. In both cases hashing adds about 1 hundredth of a second, and gives us something easier to look at when comparing, saving in DB etc.

-- as a side note I love hashing, it's like epoxy glue, or duct tape it just works on everything.

So we simply hash that, save it to a variable and compare it all we want:

 print md5( $xml->saveXML() );

'19edc177072416b7bbf88ea0a240be73'
'elapsedTime' => '0.11 seconds',
'elapsedMemory' => '0.39 KB'


 print sha1( $xml->saveXML() );

'7c644c6e1630ffde15eee64643779e415a1746b7'
'elapsedTime' => '0.11 seconds',
'elapsedMemory' => '0.3 KB'

Now I will probably get knocked for using saveXML() ( and/or C14N() ) but ultimately what it boils down to is this. Even counting the attributes which can be done this way ( just to cover my bases):

 $old_xp = new \DOMXpath($xml);
 $old_a = $old_xp->evaluate('count( //@* )');
 $old_n = $old_xp->evaluate('count( //node() )');
 print 'Attributes: '.$old_a.'<br>';
 print 'Nodes: '.$old_n.'<br>';
 print 'Total: '.($old_a + $old_n).'<br>';

outputs: / 1 iterations ( check against the xml posted above ):

 Attributes: 6
 Nodes: 7  //expected 4 nodes
 Total: 13

outputs: / 10,000 iterations:

'elapsedTime' => '0.02 seconds',
'elapsedMemory' => '0.5 KB'
 Attributes: 60000
 Nodes: 60001  //expected 30,001 nodes ( +2 tracks, +1 note, node per loop and one album node )?
 Total: 120001

As you see time is faster, but because we are instantiating DOMXpath here, which might not affect you if you already have an instance available, memory consumption is almost double.

-- as a side note it appears that $old_xp->evaluate('count( //node() )') is giving strange counts for the nodes I expected 4 nodes and got 7, like it counts both the open and close tags and the encoding tag, or counts the nesting for each child node ( checked this by adding a note node on the second track, which had none, and the count indeed went up by 2 ) any more information on this would be helpful.

Anyway you know the rest for this method.

However, when using the counts, if you were to remove 1 attribute and add another it will incorrectly count the attributes, the same applies for the nodes ( baring it's weird counting ).

But, ultimately there is no way to know if it changed, without looking at the actual data, what if the contents of a node change? etc...

And that ( just counting ) may be good enough for your needs

The choice is yours and it really depends on what level of detail you need, and how much of a performance loss your willing to take for that level of detail.

I suggest thoroughly benchmarking each step and then deciding which is more acceptable for your needs.

Lastly just generating the XML gave me the following time/memory usage ( remember saving and hashing was only 0.11 seconds ):

'elapsedTime' => '21.16 seconds',
'elapsedMemory' => '0.61 KB'

We talk about performance here but no numbers were given, so we really need to put things into context when we make decisions based on performance.

Thanks,

ArtisticPhoenix
  • 21,464
  • 2
  • 24
  • 38
  • I see that you are in a correct way: coding and benchmarking (!). About C14N, see I write here at NOTE-2, "...not need canonical representation (!)...", and how I classify black-boxes [in this other post](http://stackoverflow.com/q/24940429/287948). About performance of C14N, see [the explosion with big node-numbers](http://willem.stuursma.name/2014/03/05/fast-c14n-xml-canonicalization-of-big-xml-documents-in-php/). – Peter Krauss Jul 25 '14 at 11:24
  • About your good performance analysis, and sugestions, in a context of an "[Alternative solution](http://stackoverflow.com/a/24933248/287948)": change from number of nodes, to `sha1`, is a good issue, thanks! I must test and do my home-work now, to verify your suggestions. – Peter Krauss Jul 25 '14 at 11:28
  • @ About performance of C14N - it may be worse the deeper the nesting I was only nesting 3 levels deep at most, and I could see how it could exponentially increase the deeper the node tree. – ArtisticPhoenix Jul 25 '14 at 12:48
  • Ops, about your use of `$old_a + $old_n`, XPath considere attributes and elements as a nodes (!), not make sense to sum. The only moment when DOM model considere an attribute collection (not the isolated attribute) as "item of the tree" is in a replace operation, see [this DOM-tree modeling consideration](http://stackoverflow.com/a/24902417/287948). – Peter Krauss Jul 25 '14 at 18:13
  • I was voted to your answer, by your effort on the right path; but you not get the bounty because **your results seems wrong**. I do my home-work today, and the result ([see here](http://stackoverflow.com/a/24933248/287948)) is that XPath is ~10x better, and the string-direct comparison ~25% faster than MD5's. PS: to get the 50% of the bounty you need more 1 vote. – Peter Krauss Jul 27 '14 at 13:34
0

There are three possible solutions to this general problem:

1) Save a copy (or a serialized copy) as you have suggested. This solution can be accomplished completely externally (assuming cloning or serializing exists). However as you have mentioned, it can be extremely slow for complex objects.

2) Provide the service (change detection) as a feature of the DomDocument (or DomNode) class. This solution requires you to have access to the definition of DomDocument, so you probably don't want to consider it. However, it is the "correct" place to put the service for efficiency since the class itself knows best how to determine if any action will produce a change in the DOM.

3) Provide the service (change recording) as a feature of step 3. above. That is make the code "perform some operation" keep track of whether or not a change was made. For some operations this may require keeping a local copy (or serialized copy) of a small localized part of the DOM. Or, it may require checking the current state of some node(s) before performing the operation in order to determine whether or not a change occurred. However, for most operations the answer should be fairly obvious.

Dwayne Towell
  • 8,154
  • 4
  • 36
  • 49
0

Hm, I looked at the Object structure of DOMDocument. Looks like there is no easy way to do this. My first solution would be to compare the results of the method saveXML. But this is not what you want. Let's make it faster then:

Variant A:

Use the internal serialize php function: http://php.net/manual/de/function.serialize.php. This would also compare the whole document but in serialized form. Could be a little faster than comparing saveXML results.

Variant B:

The first parameter of the method saveXML is interesting. If you know which nodes could have been changed you can reduce your output to only these (see: http://php.net/manual/de/domdocument.savexml.php). This way you don't have to compare the whole document.

Variant C:

If the DOMDocument object can't determine changes on it's own let's teach it to do so! This would look like this (haven't tested it for syntax but it's the idea which is important):

use DOMDocument;
use DOMNode;

class AlterationSensitiveDOMDocument extends DOMDocument
{
    /**
     * @var bool
     *   Determines whether the DOMDocument was altered or not.
     */
    protected $isAltered = false;

    /**
     * @param DOMNode $newnode
     *
     * @return DOMNode|void
     */
    public function appendChild(DOMNode $newnode)
    {
        $this->isAltered = true;

        return parent::appendChild($newnode);
    }

    // More overrides (The ones you need)

    /**
     * @return boolean
     */
    public function isAltered()
    {
        return $this->isAltered;
    }
}

By using the new AlterationSensitiveDOMDocument instead of DOMDocument you will have full access to the DOMDocument parameters and methods. PLUS some of the altering methods (the ones you override) will set the state "isAltered" to true, so you can see wheter the DOMDocument object was altered or not.

This solution smells like a hack but should do the trick.

  • Sorry, please read with attention, I have noticed all... About your Variant-A) see my *NOTE-1: you can not serialize a DomDocument Object*; about your Variant-B) I say *Not translating all to XML by saveXML() method, but only (FASTER!) coping all binary representarion*; About your Variant-C) see my new reinforce *This question is not about "how controll changes or a change-flag"*. – Peter Krauss Jul 23 '14 at 17:14
  • Do you know something about binary representation functions, with [libxml-tree dumps](http://xmlsoft.org/html/libxml-tree.html#xmlBufNodeDump)? (see EDIT2). – Peter Krauss Jul 23 '14 at 19:30
  • Read variant B again. where did I say you have to translate all to XML? –  Jul 24 '14 at 10:01
  • My question is about the "whole document", to check if any byte changed... The only "Alternative solution" (no so good), as I expressed, is to check number of nodes, and is waht I am doing today: with a XPath `count(//node())` I can check if changed faster than `saveXML()`, but if the DOM have same number of nodes (ex. change attribute values) I need to check saveXML, spending CPU time to generate and to compare DOMs. – Peter Krauss Jul 24 '14 at 11:44
  • PS: sorry about your effort with your `AlterationSensitiveDOMDocument` class, I have a *"black box problem"* (ex. executing a XSLT I can not say if changed or not), so I say in the question that something like your variant-C is not valid. – Peter Krauss Jul 24 '14 at 12:13