2

I am trying to convert a Word-generated XML file to JSON through PHP.

I have looked around and found for all XML files the best case to be the following code (even on PHP documentation):

$xml = simplexml_load_string($xml_string);
$json = json_encode($xml);
$array = json_decode($json,TRUE);

The problem is that after simplexml_load_string I get an empty SimpleXMLElement object and the rest of the steps cannot really go through. The xml itself begins as :

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<w:wordDocument 

and the tags have a prefix of w:. I have tried removing the w:s but again the function returns an empty object. Any idea what I might be missing? Is there anything special about this type of generated XML ?

kjhughes
  • 106,133
  • 27
  • 181
  • 240
Kaloyan Dimov
  • 49
  • 1
  • 7
  • 2
    Short answer: Don't. Word is a complex format with mixed nodes. Parsing it into SimpleXML objects and serializing them will not work (you loose to much information) and any JSON structure keeping that information will be as complex as the original XML. Try reading the specific information from the XML that you need using DOM and Xpath (better handling of namespaces and mixed nodes) and build an array/object structure from them. Encode that generated structure. – ThW Jun 06 '19 at 10:12
  • 1
    I wouldn't say don't. [This python package](https://github.com/microsoft/Simplify-Docx) does just that. You'd have to pass doc from PHP on to a python server (aiohttp / django, etc), but [that isn't impossible](https://www.docx2json.com/) – Jthorpe Feb 11 '21 at 07:44

2 Answers2

0

check out this question: Simplexml_load_string($string) returns an empty object but $string contains xml? code below

It´s pretty similar.

Could you try to print $xml? Maybe the error isn´t the simplexml_load_string but the json_encode...

BR Marc

M. Schröder
  • 177
  • 2
  • 14
0

@ThW is correct: Don't convert OOXML to JSON. It won't help.

The complexity of OOXML (the standard behind DOCX) will not be tamed by conversion to JSON. A successful JSON conversion would be challenging and would only really serve to provide appreciation of the general advice to use XML for documents and JSON for data.

See also JSON or XML? Which is better? and note:

  • OOXML is an existing, highly complex standard for documents, not data.
  • Existing OOXML tool infrastructure is 100% XML-based.
  • Representing documents requires representation of mixed-content – something JSON is not designed to do.1

1 Somewhat ironically, mixed content is rarely used in OOXML: Runs of text are generally wrapped within w:r/w:t elements. If you're looking for inspiration that a JSON-based DOCX representation would be possible, this is it. If you're looking to understand how JSON wouldn't tame the DOCX complexity, this should also help. :-)

kjhughes
  • 106,133
  • 27
  • 181
  • 240