0

Test case code:

<?php
$str = file_get_contents('test string');
$dom = new DOMElement( 'test', $str );
var_dump( strlen($str) ); // Output int(10964)
var_dump( $dom->textContent ); // Output string(50) "O:33:"MWOps\MediaWiki\MediaWikiInstance":3:{s:46:""

The "test string" is here, the string is a serialized object. I lost tens of thousands of bytes of data at runtime. But I can't find the problem.

  • [This](https://bugs.php.net/bug.php?id=31191) may be relevant. It looks like you need to manually escape the string using `htmlspecialchars()` – Worthwelle Jul 17 '18 at 16:21
  • According to NotePad++, your test string also has NUL bytes in several places. Not sure whether that makes it a good idea to try and dump it into XML in the first place; and also not sure if you should not check your system were that data comes from first of all maybe ... seeing NUL bytes in what seem to be actual script installation paths would at least make me wonder if that's "alright" to begin with? (`{s:46:" [NUL]MWOps\MediaWiki\MediaWikiInstance[NUL] installPath"` etc.) – CBroe Jul 17 '18 at 16:29
  • @CBroe \0's are used as part of the encoding of private variables (see note in http://php.net/manual/en/function.serialize.php#refsect1-function.serialize-parameters) – Nigel Ren Jul 17 '18 at 16:32
  • @NigelRen thanks, wasn't aware of that! But probably still not a good idea to try and put those into XML without any prior treatment, I suppose? Especially if you would need to rely on being able to read them back from there properly, to be able to unserialize the data again by the same logic. – CBroe Jul 17 '18 at 16:37
  • 1
    It's a common thing that some programs will assume that a \0 is the end of a string, so even if XML would accept it (not standard and should encode it - https://stackoverflow.com/questions/19893/how-do-you-embed-binary-data-in-xml) some programs may truncate the data. – Nigel Ren Jul 17 '18 at 16:40

1 Answers1

0

Try this one - create a text node out of your string content first, and then append that to the document:

$dom = new DOMDocument();
$textNode = $dom->createTextNode($str);
$dom->appendChild($textNode);
var_dump( strlen($str) );
var_dump( $dom->textContent ); 

With your exact test data, this gives me an output of

int(10964)
string(10964) "..."

(Needs a DOMDocument here, because only that has the createTextNode method. I did not create an additional test element here, put adding that in first and then appending the text node to that should work the same way.)


Whether you might need additional encoding of the NUL bytes or not, probably depends on how you read that data back later and for what purpose.

CBroe
  • 91,630
  • 14
  • 92
  • 150
  • Your code has different result on my computer. See [screenshot](https://drive.google.com/open?id=1zKy4UaLoHIxTFjlRXGygqJBO2VjMXkNC). My environment is Windows 7 and PHP 7.2.0. – RazeSoldier Jul 18 '18 at 07:04
  • Hm, then it is probably specific to the PHP version. Have you tried any escaping/encoding yet? What do you eventually need this file for, who/what is going to process it? – CBroe Jul 18 '18 at 07:09
  • I try `htmlspecialchars( $str, ENT_XML1 )` before adding the string to XML, but no effect. I still get a 50 byte string. – RazeSoldier Jul 18 '18 at 11:14
  • Have you checked what the generated XML document contains, not just its length? Where exactly does it “stop”/break off the content - at the first NUL byte, or somewhere else? And again, what do you eventually need this for? I am asking to try and figure out whether those NUL bytes could be removed/escaped in any way perhaps, without breaking whatever functionality will eventually try to use this data. – CBroe Jul 18 '18 at 11:21
  • Have you tried putting the content explicitly into a CDATA section maybe? http://php.net/manual/en/domdocument.createcdatasection.php – CBroe Jul 18 '18 at 11:22
  • I am plan to store this serialized object in XML and then read the string and then unserialize it and use it. – RazeSoldier Jul 18 '18 at 11:28
  • I also tried a CDATA section and other extension (XMLWriter), but still no effect. – RazeSoldier Jul 18 '18 at 11:29
  • Have you tested whether unserializing _needs_ those NUL bytes? Maybe they could be removed to begin with, without disturbing that. – CBroe Jul 18 '18 at 11:39
  • $str = file_get_contents( 'test string' ); $str = preg_replace('/\0/', null, $str); var_dump( unserialize($str) ); Notice: unserialize(): Error at offset 96 of 10950 bytes – RazeSoldier Jul 18 '18 at 11:44
  • I tried to encode serialized objects with base 64 and I store success. It seems that serialized objects cannot be directly stored in XML. – RazeSoldier Jul 18 '18 at 11:50
  • Hm, I think my test yesterday was faulty, probably due to copy&paste of the input data. No matter which way I try now, it always seems to cut of at the first NUL byte - so perhaps indeed some sort of C-like “this got to be the end of the string issue” within the DOMDoc implementation itself. Maybe you could try and replace the NUL bytes with `` first - but then you’d have to pay attention to not get that ”double-encoded” somewhere along the way. – CBroe Jul 18 '18 at 11:53
  • Last alternative I could think of - you split your original input at those NUL bytes, then loop over the resulting parts, append each of those as a new text node, and put a text node containing nothing but `` in between each, so that in the end the result should again be equivalent to the original data (after it is read back from the XML, during which `` should be translated back to NUL again ... hopefully?) – CBroe Jul 18 '18 at 11:55
  • It seems complicated to implement, I prefer a simple base 64 encoding. – RazeSoldier Jul 18 '18 at 11:59
  • Well if that works for the receiving end as well, resp. you can modify that to decode the base64 there again … then it’s probably the easiest option. – CBroe Jul 18 '18 at 12:01
  • I didn't expect XML to dislike NUL. Thank you for your help – RazeSoldier Jul 18 '18 at 12:05
  • It’s probably not the fault of XML in itself, but an error in the implementation by PHP - PHP is written in C itself, so how _that_ treats NUL bytes in strings (as meaning “string ends here”) likely affects the outcome here. – CBroe Jul 18 '18 at 12:09