5

I'm trying to convert some XML to JSON, which is easy enough with PHP

$file = file_get_contents('data.xml' );
$a = json_decode(json_encode((array) simplexml_load_string($file)),1);
print_r($a);

Taking the following XML

<?xml version="1.0" encoding="UTF-8"?>
<foo>
    <bar>
        <one lang="fr" type="bar">Test</one>
        <one lang="fr" type="foo">Test</one>
        <one lang="fr" type="baz">Test</one>
    </bar>

    <thunk>
        <thud>
            <bar lang="fr" name="bob">test</bar>
            <bar lang="bz" name="frank">test</bar>
            <bar lang="ar" name="alive">test</bar>
            <bar lang="fr" name="bob">test</bar>
        </thud>
    </thunk>

</foo>

And paring it through simplexml produces

Array
(
    [bar] => Array
        (
            [one] => Array
                (
                    [0] => Test
                    [1] => Test
                    [2] => Test
                )

        )

    [thunk] => Array
        (
            [thud] => Array
                (
                    [bar] => Array
                        (
                            [0] => test
                            [1] => test
                            [2] => test
                            [3] => test
                        )

                )

        )

)

Where ideally the output would look like this

{
    "foo": {
        "bar": {
            "one": [
                {
                    "_lang": "fr",
                    "_type": "bar",
                    "__text": "Test"
                },
                {
                    "_lang": "fr",
                    "_type": "foo",
                    "__text": "Test"
                },
                {
                    "_lang": "fr",
                    "_type": "baz",
                    "__text": "Test"
                }
            ]
        },
        "thunk": {
            "thud": {
                "bar": [
                    {
                        "_lang": "fr",
                        "_name": "bob",
                        "__text": "test"
                    },
                    {
                        "_lang": "bz",
                        "_name": "frank",
                        "__text": "test"
                    },
                    {
                        "_lang": "ar",
                        "_name": "alive",
                        "__text": "test"
                    },
                    {
                        "_lang": "fr",
                        "_name": "bob",
                        "__text": "test"
                    }
                ]
            }
        }
    }
}

Trouble is that the output doesn't contain all the attributes for the child elements, some of these elements contain two or more attributes, is there a way to transform the xml with PHP or Python and include all the attributes found in all the children?

Thanks

user2988129
  • 191
  • 3
  • 13

3 Answers3

12

In my answer I'll cover PHP, specifically SimpleXMLElement which is already part of your code.

The basic way to JSON encode XML with SimpleXMLElement is similar to what you have in your question. You instantiate the XML object and then you json_encode it (Demo):

$xml = new SimpleXMLElement($buffer);
echo json_encode($xml, JSON_PRETTY_PRINT);

This produces an output close but not exactly like what you're looking for already. So what you do here with simplexml is that you change the standard way how json_encode will encode the XML object.

This can be done with a new subtype of SimpleXMLElement implementing the JsonSerializable interface. Here is such a class that has the default way how PHP would JSON-serialize the object:

class JsonSerializer extends SimpleXmlElement implements JsonSerializable
{
    /**
     * SimpleXMLElement JSON serialization
     *
     * @return null|string
     *
     * @link http://php.net/JsonSerializable.jsonSerialize
     * @see JsonSerializable::jsonSerialize
     */
    function jsonSerialize()
    {
        return (array) $this;
    }
}

Using it will produce the exact same output (Demo):

$xml = new JsonSerializer($buffer);
echo json_encode($xml, JSON_PRETTY_PRINT);

So now comes the interesting part to change the serialization just these bits to get your output.

First of all you need to differ between whether it's an element carrying other elements (has children) or it is a leaf-element of which you want the attributes and the text value:

    if (count($this)) {
        // serialize children if there are children
        ...
    } else {
        // serialize attributes and text for a leaf-elements
        foreach ($this->attributes() as $name => $value) {
            $array["_$name"] = (string) $value;
        }
        $array["__text"] = (string) $this;
    }

That's done with this if/else. The if-block is for the children and the else-block for the leaf-elements. As the leaf-elements are easier, I've kept them in the example above. As you can see in the else-block it iterates over all attributes and adds those by their name prefixed with "_" and finally the "__text" entry by casting to string.

The handling of the children is a bit more convoluted as you need to differ between a single child element with it's name only or multiple children with the same name which require an additional array inside:

        // serialize children if there are children
        foreach ($this as $tag => $child) {
            // child is a single-named element -or- child are multiple elements with the same name - needs array
            if (count($child) > 1) {
                $child = [$child->children()->getName() => iterator_to_array($child, false)];
            }
            $array[$tag] = $child;
        }

Now there is another special case the serialization needs to deal with. You encode the root element name. So this routine needs to check for that condition (being the so called document-element) (compare with SimpleXML Type Cheatsheet) and serialize to that name under occasion:

    if ($this->xpath('/*') == array($this)) {
        // the root element needs to be named
        $array = [$this->getName() => $array];
    }

Finally all left to be done is to return the array:

    return $array;

Compiled together this is a JsonSerializer done in simplexml tailored to your needs. Here the class and it's invocation at once:

class JsonSerializer extends SimpleXmlElement implements JsonSerializable
{
    /**
     * SimpleXMLElement JSON serialization
     *
     * @return null|string
     *
     * @link http://php.net/JsonSerializable.jsonSerialize
     * @see JsonSerializable::jsonSerialize
     */
    function jsonSerialize()
    {
        if (count($this)) {
            // serialize children if there are children
            foreach ($this as $tag => $child) {
                // child is a single-named element -or- child are multiple elements with the same name - needs array
                if (count($child) > 1) {
                    $child = [$child->children()->getName() => iterator_to_array($child, false)];
                }
                $array[$tag] = $child;
            }
        } else {
            // serialize attributes and text for a leaf-elements
            foreach ($this->attributes() as $name => $value) {
                $array["_$name"] = (string) $value;
            }
            $array["__text"] = (string) $this;
        }

        if ($this->xpath('/*') == array($this)) {
            // the root element needs to be named
            $array = [$this->getName() => $array];
        }

        return $array;
    }
}

$xml = new JsonSerializer($buffer);
echo json_encode($xml, JSON_PRETTY_PRINT);

Output (Demo):

{
    "foo": {
        "bar": {
            "one": [
                {
                    "_lang": "fr",
                    "_type": "bar",
                    "__text": "Test"
                },
                {
                    "_lang": "fr",
                    "_type": "foo",
                    "__text": "Test"
                },
                {
                    "_lang": "fr",
                    "_type": "baz",
                    "__text": "Test"
                }
            ]
        },
        "thunk": {
            "thud": {
                "bar": [
                    {
                        "_lang": "fr",
                        "_name": "bob",
                        "__text": "test"
                    },
                    {
                        "_lang": "bz",
                        "_name": "frank",
                        "__text": "test"
                    },
                    {
                        "_lang": "ar",
                        "_name": "alive",
                        "__text": "test"
                    },
                    {
                        "_lang": "fr",
                        "_name": "bob",
                        "__text": "test"
                    }
                ]
            }
        }
    }
}

I hope this was helpful. It's perhaps a little much at once, you find the JsonSerializable interface documented in the PHP manual as well, you can find more example there. Another example here on Stackoverflow with this kind of XML to JSON conversion can be found here: XML to JSON conversion in PHP SimpleXML.

Community
  • 1
  • 1
hakre
  • 193,403
  • 52
  • 435
  • 836
7

I expanded on the answer by hakre. Now differentiates multiple children better. Includes attributes from entire chain except root element.

/**
 * Class JsonSerializer
 */
class JsonSerializer extends SimpleXmlElement implements JsonSerializable
{
    const ATTRIBUTE_INDEX = "@attr";
    const CONTENT_NAME = "_text";

    /**
     * SimpleXMLElement JSON serialization
     *
     * @return array
     *
     * @link http://php.net/JsonSerializable.jsonSerialize
     * @see JsonSerializable::jsonSerialize
     * @see https://stackoverflow.com/a/31276221/36175
     */
    function jsonSerialize()
    {
        $array = [];

        if ($this->count()) {
            // serialize children if there are children
            /**
             * @var string $tag
             * @var JsonSerializer $child
             */
            foreach ($this as $tag => $child) {
                $temp = $child->jsonSerialize();
                $attributes = [];

                foreach ($child->attributes() as $name => $value) {
                    $attributes["$name"] = (string) $value;
                }

                $array[$tag][] = array_merge($temp, [self::ATTRIBUTE_INDEX => $attributes]);
            }
        } else {
            // serialize attributes and text for a leaf-elements
            $temp = (string) $this;

            // if only contains empty string, it is actually an empty element
            if (trim($temp) !== "") {
                $array[self::CONTENT_NAME] = $temp;
            }
        }

        if ($this->xpath('/*') == array($this)) {
            // the root element needs to be named
            $array = [$this->getName() => $array];
        }

        return $array;
    }
}
OIS
  • 9,833
  • 3
  • 32
  • 41
-1

You can use the lxml library for python

It's a powerful tool that lets you reference attributes of elements.

g_let
  • 89
  • 7
  • Whilst this may theoretically answer the question, [it would be preferable](//meta.stackoverflow.com/q/8259) to include the essential parts of the answer here, and provide the link for reference. – OhBeWise Jul 07 '15 at 21:19