0

I need to turn a node of an XML, recursively, into a json string. I have for the most part

$sku = "AC2061414";
$dom = new SimpleXMLElement(file_get_contents( "/usr/share//all_products.xml" )); 
$query = '//sku[text() = "'.$sku.'"]';
$entries = $dom->xpath($query);

foreach ($entries as $entry) {

    $parent_div = $entry->xpath( 'parent::*' );
    $nodearray=array();

    foreach($parent_div as $node) {
        if ($node->nodeType == XML_CDATA_SECTION_NODE) {
            $nodearray[$node->getName()]=$node->textContent;
        }else{
            $nodearray[$node->getName()]=$node;
        }
    }
    $ajax = json_encode( $nodearray );
    print($ajax);
}

Run on

<?xml version="1.0" encoding="UTF-8"?>
<products>
   <product active="1" on_sale="0" discountable="1">
    <sku>AC2061414</sku>
    <name><![CDATA[ALOE CADABRA ORGANIC LUBE PINA COLADA 2.5OZ]]></name>
    <description><![CDATA[ text text ]]></description>
    <keywords/>
    <price>7.45</price>
    <stock_quantity>30</stock_quantity>
    <reorder_quantity>0</reorder_quantity>
    <height>5.25</height>
    <length>2.25</length>
    <diameter>0</diameter>
    <weight>0.27</weight>
    <color></color>
    <material>aloe vera, vitamin E</material>
    <barcode>826804006358</barcode>
    <release_date>2012-07-26</release_date>
    <images>
      <image>/AC2061414/AC2061414A.jpg</image>
    </images>
    <categories>
      <category code="528" video="0" parent="0">Lubricants</category>
      <category code="531" video="0" parent="528">Flavored</category>
      <category code="28" video="0" parent="25">Oral Products</category>
      <category code="532" video="0" parent="528">Natural</category>
    </categories>
    <manufacturer code="AC" video="0">Aloe Cadabra Lubes</manufacturer>
    <type code="LU" video="0">Lubes</type>
  </product>
</products>

And ends with

{"product":{"@attributes":{"active":"1","on_sale":"0","discountable":"1"},"sku":"AC2061414","name":{},"description":{},"keywords":{},"price":"7.45","stock_quantity":"30","reorder_quantity":"0","height":"5.25","length":"2.25","diameter":"0","weight":"0.27","color":{},"material":"aloe vera, vitamin E","barcode":"826804006358","release_date":"2012-07-26","images":{"image":"\/AC2061414\/AC2061414A.jpg"},"categories":{"category":["Lubricants","Flavored","Oral Products","Natural"]},"manufacturer":"Aloe Cadabra Lubes","type":"Lubes"}}

Which seem ok except for the missing node values that were CDATA. I did try to account for it but it is not working. What is the trick here?

Cœur
  • 37,241
  • 25
  • 195
  • 267
Quantum
  • 1,456
  • 3
  • 26
  • 54
  • I know it's not really what you're asking, but why are you translating XML into JSON anyway? Why not just serialize the node as XML and parse that in whatever the next stage of processing is? – IMSoP Jul 08 '13 at 18:43
  • @IMSoP the short of it is that its for a temporary view system that is ajaxed in and since the xml is 40mb it's faster to create mini json files for use later down the line... basically, as odd as it seems, it simplifies things in the over all scope. More project specific here so I didn't run down the whole process, just the part I needed. – Quantum Jul 08 '13 at 20:55
  • @jeremyBass_DC Fair enough. You could still create min-XML files rather than mini-JSON, though - just thinking that the `'@attributes'` key effectively ties you to SimpleXML anyway. – IMSoP Jul 09 '13 at 13:05
  • @IMSoP yes you are right that I could have turned them in to mini-xml packages, but javascript rather me send json so why not just do that step while the xml is already in memory verse the extra IO and what not to push it down the road as now I can just server the string. It's a diffusion of work and really, it's preference here, but for the normal person coming to this question, the important part is that LIBXML_NOCDATA flag that is important. Thank you for the though – Quantum Jul 09 '13 at 14:06

2 Answers2

1

You can try adding LIBXML_NOCDATA option to the constructor.

$dom = new SimpleXMLElement(file_get_contents( "/usr/share//all_products.xml" ), LIBXML_NOCDATA);
...

More details here.

subroutines
  • 1,458
  • 1
  • 12
  • 16
1

The problem you have here is because of json_encode, it treats the simplexmlelements you have according to their magic interfaces. See serializing @attributes for example. And also skipping all child-cdata-nodes because when reading the elements value in magic mode (compare print_r and var_dump output of simplexmlelements) those are dropped.

Because CDATA nodes can be normalized into surrounding text or just into common text-nodes, SimpleXML offers the LIBXML_NOCDATA option (on instantiation with new or simplexml_load_* functions) to do exactly this: Turn those CDATA-nodes into text-nodes and merge those text-nodes into surrounding text-nodes if any ("Merge CDATA as text nodes").

That will make print_r and also json_encode then return the node-value as string @attributes because now it is the node-value. This has been explained (well) in detail in "PHP, SimpleXML, decoding entities in CDATA".

Next to this, there is another misunderstanding from which you can greatly benefit of a fix. Even your code already contains the xpath to select an element by an attribute value, you're more interested in it's parent directly. SimpleXML will then offer all children with iteration already. So will as well for the magic properties of SimpleXML for json_encode. Compare how this allows you to reduce the code:

$xml = simplexml_load_file("/usr/share/all_products.xml", NULL, LIBXML_NOCDATA); 

// NOTE: Prevent XPath Injection by not allowing " (or ') for 
//       SKU value (validate it against a whitelist of allowed
//       characters for example)
$sku   = "AC2061414";
$query = sprintf('(//sku[text() = "%s"])[1]/..', $sku); 

$products = $xml->xpath($query);

if ($products) {
    echo json_encode(["product" => $products[0]]);
}

See the Demo.

This should give you the equal output without actually writing that much code. See the LIBXML_NOCDATA option when creating the SimpleXMLElement as well the modified xpath query which will directly query the parent (<product>) node of the (first) sku element in question. json_encode then takes care of all children due to common traversal on the magic properties it provides.

See as well:

Community
  • 1
  • 1
hakre
  • 193,403
  • 52
  • 435
  • 836
  • Very nice expanded explanation here, and yes I did know that I had expanded out the transversal of the nodes, but I couldn't think of how else to add a condition based on the child node in order to isolate the root of the issue. I was not far off what you had there, but for sure that is a better way to go. I will leave the first answer as the accepted since it's at the hart of this, but by far this is much more informative to the reader, thank you. – Quantum Jul 08 '13 at 14:01
  • Thanks for the comment, I somewhat thought so. With XML one needs to wrap the head around the hierarchy and with JSON this is similar. Was interesting to fiddle with yours here, I also was able to dig-up some existing information which hopefully is useful. There is one thing I was wondering about a little: For the element that gets into json_encode, the @attributes property is created but only for this one, not the children then. I was not sure if this was also part of your issue or not. – hakre Jul 09 '13 at 09:55
  • I wouldn't mind having the `@attributes` persist to be honest, but it's not do or die in this case, but future wise, it is odd that it is dropping them. Little buggy there I believe as it should recursively handle that. well IMHO at the least. If you have ideas, I would welcome that for sure. Thank you – Quantum Jul 09 '13 at 14:10
  • Yes, I so far didn't found an explanation that well explains why this happens. As non-obvious as this looks like, there might be a good reason for this. But no good clue from my end so far, but I might pick these loose ends up later. If you're looking on how to [control json serialization more with simplexml, I have once covered this more generally in another answer](http://stackoverflow.com/a/16938322/367456). Is probably good for you to know that such is possible with PHP. – hakre Jul 09 '13 at 14:15
  • About @attributes and the different handling of the first vs. the traversed elements, I could give it a write-up and I think I could put some sense into it: [SimpleXML and JSON Encode in PHP – Part I](http://hakre.wordpress.com/2013/07/09/simplexml-and-json-encode-in-php-part-i/). It looks like this is a trade-off / compromise. I plan to write a second part showing what I linked in the other answer. – hakre Jul 09 '13 at 16:28
  • nice write up, and yes that is an excellent idea there, I was just going to transverse the json object but cutting it and making it a little nicer I think would be grand. I wil be looking deeper into this when I get back in :D – Quantum Jul 09 '13 at 16:58