1

I have a set of XLSX files that PhpSpreadsheet cannot load, because simplexml_load_string returns an empty SimpleXMLelement from (for instance) the workbook XML file.

The file has the following format, that can be loaded by simplexml after removing all occurrences of the x: namespace, and the declaration itself (that is, for instance, the <x:workbook> tag has been converted to <workbook>).

<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<x:workbook xmlns:x15ac="http://schemas.microsoft.com/office/spreadsheetml/2010/11/ac" xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships" xmlns:mc="http://schemas.openxmlformats.org/markup-compatibility/2006" xmlns:x15="http://schemas.microsoft.com/office/spreadsheetml/2010/11/main" xmlns:xr="http://schemas.microsoft.com/office/spreadsheetml/2014/revision" xmlns:xr6="http://schemas.microsoft.com/office/spreadsheetml/2016/revision6" xmlns:xr10="http://schemas.microsoft.com/office/spreadsheetml/2016/revision10" xmlns:xr2="http://schemas.microsoft.com/office/spreadsheetml/2015/revision2" mc:Ignorable="x15 xr xr6 xr10 xr2" xmlns:x="http://schemas.openxmlformats.org/spreadsheetml/2006/main">
  <x:fileVersion appName="xl" lastEdited="7" lowestEdited="4" rupBuild="23801" />
  <x:workbookPr codeName="ThisWorkbook" />
  <mc:AlternateContent xmlns:mc="http://schemas.openxmlformats.org/markup-compatibility/2006">
    <mc:Choice Requires="x15">
      <x15ac:absPath xmlns:x15ac="http://schemas.microsoft.com/office/spreadsheetml/2010/11/ac" url=".........." />
    </mc:Choice>
  </mc:AlternateContent>
  <xr:revisionPtr revIDLastSave="0" documentId=".........." xr6:coauthVersionLast="46" xr6:coauthVersionMax="46" xr10:uidLastSave="{00000000-0000-0000-0000-000000000000}" />
  <x:bookViews>
    <x:workbookView xWindow="-120" yWindow="-120" windowWidth="29040" windowHeight="15840" xr2:uid="{00000000-000D-0000-FFFF-FFFF00000000}" />
  </x:bookViews>
  <x:sheets>
    <x:sheet name="......" sheetId="1" r:id="rId1" />
  </x:sheets>
  <x:calcPr calcId="191029" />
</x:workbook>

I'm not sure the XML file is wrong, since the XLSX file(s) can be opened - for instance - with Libre Office. Anyway, have managed to load the file(s) hacking a simple minded function cleanup_xml() in Xlsx.php:

                    //~ http://schemas.openxmlformats.org/spreadsheetml/2006/main"
                    $xmlWorkbook = simplexml_load_string(
                      cleanup_xml($this->securityScanner->scan($this->getFromZipArchive($zip, "{$rel['Target']}"))),
                        'SimpleXMLElement',
                        Settings::getLibXmlLoaderOptions()
                    );

Maybe there is a proper/clean way to force simplexml API to load such files ?

edit:

I was wrong thinking all problems were gone after the cleanup_xml hack. Seems that also the data rows XML file has problems, probably the same as above...

edit:

Indeed, I moved cleanup_xml() into XmlScanner::scan, to apply to every loaded XML, and now seems to work...

edit:

Seems the namespace declaration is correct, at least, from this simple example...

Then, I wonder why simplexml_load_string doesn't accept the format:

<x:workbook ... xmlns:x="http://schemas.openxmlformats.org/spreadsheetml/2006/main">
....
</x:workbook>

while it apparently accepts

<workbook ... xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/main">
....
<workbook>

edit

Have digged into simplexml API, this answer helped to understand the problem. Now I can try to rewrite my hackish cleanup_xml accounting for namespaces... Just wondering if PhpSpreadsheet offers a better way... seems strange this problem has been unnoticed before...

edit

ok, now I've found the bug report...

IMSoP
  • 89,526
  • 13
  • 117
  • 169
CapelliC
  • 59,646
  • 5
  • 47
  • 90

1 Answers1

1

This appears to be a bug in PhpSpreadsheet.

Opening an XLSX file I created this week with a real copy of Microsoft Excel, the "workbook.xml" starts like this:

<workbook
 xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/main" 
 xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships"
 xmlns:mc="http://schemas.openxmlformats.org/markup-compatibility/2006" 
 mc:Ignorable="x15 xr xr6 xr10 xr2"
 xmlns:x15="http://schemas.microsoft.com/office/spreadsheetml/2010/11/main" 
 xmlns:xr="http://schemas.microsoft.com/office/spreadsheetml/2014/revision" 
 xmlns:xr6="http://schemas.microsoft.com/office/spreadsheetml/2016/revision6" 
 xmlns:xr10="http://schemas.microsoft.com/office/spreadsheetml/2016/revision10"
 xmlns:xr2="http://schemas.microsoft.com/office/spreadsheetml/2015/revision2">

This declares eight different namespaces that will be used in the document. One happens to be defined as the "default namespace", and the other seven are assigned prefixes - but all of that is just local to this specific file.

If we look at your XML document, we can see all the same namespaces in use, plus an extra one:

<x:workbook
 xmlns:x15ac="http://schemas.microsoft.com/office/spreadsheetml/2010/11/ac"
 xmlns:r="http://schemas.openxmlformats.org/officeDocumen/2006/relationships"
 xmlns:mc="http://schemas.openxmlformats.org/markup-compatibility/2006"
 xmlns:x15="http://schemas.microsoft.com/office/spreadsheetml/2010/11/main"
 xmlns:xr="http://schemas.microsoft.com/office/spreadsheetml/2014/revision"
 xmlns:xr6="http://schemas.microsoft.com/office/spreadsheetml/2016/revision6"
 xmlns:xr10="http://schemas.microsoft.com/office/spreadsheetml2016/revision10"
 xmlns:xr2="http://schemas.microsoft.com/office/spreadsheetml/2015/revision2"
 mc:Ignorable="x15 xr xr6 xr10 xr2"
 xmlns:x="http://schemas.openxmlformats.org/spreadsheetml/2006/main">

The only difference is that the namespace "http://schemas.openxmlformats.org/spreadsheetml/2006/main" has been assigned prefix "x", rather than set as the default namespace, but that makes no difference to its meaning. A different library might label the namespaces completely differently, just because of the way it generates the XML:

<ns0:workbook
 xmlns:ns0="http://schemas.openxmlformats.org/spreadsheetml/2006/main" 
 xmlns:ms1="http://schemas.openxmlformats.org/officeDocument/2006/relationships"
 xmlns:ns2="http://schemas.openxmlformats.org/markup-compatibility/2006" 
 ns2:Ignorable="x15 xr xr6 xr10 xr2"
 xmlns:ns3="http://schemas.microsoft.com/office/spreadsheetml/2010/11/main" 
 xmlns:ns4="http://schemas.microsoft.com/office/spreadsheetml/2014/revision" 
 xmlns:ns5="http://schemas.microsoft.com/office/spreadsheetml/2016/revision6" 
 xmlns:ns6="http://schemas.microsoft.com/office/spreadsheetml/2016/revision10"
 xmlns:ns7="http://schemas.microsoft.com/office/spreadsheetml/2015/revision2">

As explained in this reference answer, SimpleXML's namespace handling is based around using the ->children() method to select the namespace you want to work with. The correct way to use this is to always specify the namespace URI you want, e.g. "http://schemas.openxmlformats.org/spreadsheetml/2006/main" or "http://schemas.microsoft.com/office/spreadsheetml/2016/revision10".

However, because the same program generally creates XML documents with the same choice of prefixes, it's easy to write incorrect code which relies on:

  • A particular namespace being the default, and therefore selected before you first call ->children()
  • Particular namespaces being bound to particular prefixes, and therefore selectable by looking up that prefix

The author of PhpSpreadsheet appears to have made both mistakes, meaning that when you try to load a document created by a different program, it doesn't find the namespaces it expects even though they're actually there.

IMSoP
  • 89,526
  • 13
  • 117
  • 169