2

I need to access the internal binary representation of a loaded XML DOM... There are some dump functions, but I not see something like "binary buffer" (there are only "XML buffers").

My last objective is to compare byte-by-byte, the same document, before and after some black-box procedure, directly with their binary (current and cached) representations, without convertion (to XML-text representation)... So, the question,

There are a binary representation (in-memory structures) in LibXML2, to compare dump with current representations?

I need only to check if current and dumped DOMs are equivalent.


Details

It is not a problem of comparing two distinct DOM objects, but something more easy, because not change IDs, etc. not need canonical representation (!), only need access to internal representation, because is very faster than convert to text.

Between "before and after" there are a black-box procedure, ex. a XSLT Identity transform that affects (or not) some nodes or attributes.

Alternative solution...

  1. ... To develop a C function for LibXML2 that compares node-by-node the two trees, and return false if they are different: during the tree traversal, if tree structure changes, or some nodeValue changes, the algorithm stops the comparison (returning false).

  2. ... Not the ideal, but helps some other algorithms: if I can access (in LibXML2) the total number of nodes or the total length or size or md5 or sha1... Only to optimize frequent cases (for my application) where the comparison will returns false, avoiding the complete comparison-procedure.


NOTES

Related questions

Warning for reader using answered solutions

The problem is about "to compare before with after a back-box operation", but there are two kinds of back-boxes here:

  • Well-known and controllable ones, like XSLT transforms or use of a known library. You must known that your black-boxes will not change attribute order or ID content or denormalize spaces (or etc.).
  • Full-free ones, like use of a external editor (ex. online-editor changing a XHTML), where user and software can do anything.

I will use a solution in a context of "well-known" black-box. So, my comments at "Details" section above, are valid.

In a context of "full-free" back-boxes, you can not to use a "comparison of binary dumps", because only a canonical representation (C14N) is valid to compare. To compare by C14N-criteria, only "Alternative solutions" (commented above) are possible. For alternative-1, you must, among other things, sort before compare a set of attribute-nodes. For alternative-2 (also discussed here), to generate the C14N dumps.


PS: of course, use of the C14N criteria is subjective, depends on application: if, p. ex., for your appication "change attribute order" is a valid/important change, the comparasion that detects it is valid (!).

Community
  • 1
  • 1
Peter Krauss
  • 13,174
  • 24
  • 167
  • 304
  • You do know that XML is a *text* format, so the binary representation would just be a sequence of characters in whatever encoding the XML is in. – Some programmer dude Jul 24 '14 at 17:42
  • @JoachimPileborg, yes, the nodes are "text nodes", but there are no (binary) tree representation? I think there are (!)... I not see there (some graphic documentation?) what the name of the C data-structure for this *main tree*, that is distinct from a "XML-text dump". – Peter Krauss Jul 24 '14 at 17:44
  • 1
    Yes there is a binary representation, and if the XML is UTF-8 encoded the binary representation is a sequence of UTF-8 code points. What would the binary representation of e.g. `some data` be if not the binary representation of the characters making up the XML, in other words the characters themselves. – Some programmer dude Jul 24 '14 at 17:50
  • Or do you mean the actual in-memory structures that libxml2 uses? – Some programmer dude Jul 24 '14 at 17:51
  • You *do* know that libxml2 is open source, and that you can download the source from e.g. [here](http://xmlsoft.org/), and take a look at all the structures and all the code. It probably won't help you though, as the internal data structures are just that, internal. – Some programmer dude Jul 24 '14 at 17:56
  • Yes, I mean "in-memory structures"... Perhaps if you point some documentation, I can confirm. Ex. [this random Google search](http://sunsite.ualberta.ca/Documentation/Misc/libxml2-2.4.10/html/structure.gif) have something about tree internal representation of the parser. – Peter Krauss Jul 24 '14 at 17:57
  • 1
    You can't do a byte-by-byte comparison of the internal representation of two DOM documents to see if they're the same. Two DOM documents created from the exact same XML document will compare differently because of various bits of data used in the internal representation (like pointers) are specific to the particular DOM document instance. – Ross Ridge Jul 24 '14 at 18:00
  • Yes, I can do, but only to confirm if the "in-memory structure" that you point is really what I expected... Not read a bilion of lines of code only to check my clues, I need from-a-expert confirmation, expert clues... – Peter Krauss Jul 24 '14 at 18:00
  • You can't do what you stated as being your "last objective" in your question for the reason I gave. If what you're actually trying to do is different from that you should update your question with an explanation of what you're really trying to accomplish. – Ross Ridge Jul 24 '14 at 18:06
  • @RossRidge, yes, perhaps only some properties like size or number of nodes (this will be an alternative issue)... Or a faster analiser (some simple tree traversal algorithm, to node-by-node comparison). About explain, ok, I will copy here the text of http://stackoverflow.com/q/24586619/287948 – Peter Krauss Jul 24 '14 at 18:09
  • About assertion "can't do a byte-by-byte comparison of the internal representation of two DOM documents", **IS FALSE** in the context of this questiom, please see my new edition exaplaing that **is the same DOM, before and after some black-box-changes**. – Peter Krauss Jul 24 '14 at 18:44
  • I'm confident what I said I was true given the context at the time it was made. In any case, it looks like you're on your own here as no one who's tried to help you seems to be as qualified to answer the question as you are. Once you do figure out the answer please remember to post it here so we can all benefit from it. – Ross Ridge Jul 24 '14 at 19:50
  • @RossRidge, sorry, was not directed to you: was just after do my edit and see that someone added "+1" in your comment, see the "1" there. – Peter Krauss Jul 24 '14 at 20:13

1 Answers1

1

Here are the relevant libxml2 methods:

There is a base64 encoding method:

Function: xmlTextWriterWriteBase64

int xmlTextWriterWriteBase64    (xmlTextWriterPtr writer, 
                     const char * data, 
                     int start, 
                     int len)

Write an base64 encoded xml text.
writer: the xmlTextWriterPtr
data:   binary data
start:  the position within the data of the first byte to encode
len:    the number of bytes to encode
Returns:    the bytes written (may be 0 because of buffering) or -1 in case of error

and a BinHex encoding method:

Function: xmlTextWriterWriteBinHex
int xmlTextWriterWriteBinHex    (xmlTextWriterPtr writer, 
                     const char * data, 
                     int start, 
                     int len)

Write a BinHex encoded xml text.
writer: the xmlTextWriterPtr
data:   binary data
start:  the position within the data of the first byte to encode
len:    the number of bytes to encode
Returns:    the bytes written (may be 0 because of buffering) or -1 in case of error

References

Community
  • 1
  • 1
Paul Sweatte
  • 24,148
  • 7
  • 127
  • 265