What's wrong with iShares SP 500 ETF's Excel file?

Question

I am trying to parse iShares SP 500 ETF's Excel file, which looks like this:

<?xml version="1.0"?>
<ss:Workbook xmlns:ss="urn:schemas-microsoft-com:office:spreadsheet">
<ss:Styles>
<ss:Style ss:ID="Default">
<ss:Alignment ss:Horizontal="Left"/>
</ss:Style>
...

It seems to be an old XLS Excel type file, but it is an XML file, yet xml.etree.ElementTree is complaining a lot.

I have tried:

import xml.etree.ElementTree as ET
tree = ET.parse(file_name)

and with encoding:

import xml.etree.ElementTree as ET
tree = ET.parse(file_name, parser=ET.XMLParser(recover=True))

import xml.etree.ElementTree as ET
tree = ET.parse(file_name, parser=ET.XMLParser(encoding='utf-8'))

error:

xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 1, column 1

rb = xlrd.open_workbook(file_name, encoding_override='utf-8')
print(rb)

error:

xlrd.biffh.XLRDError: Unsupported format, or corrupt file: Expected BOF record; found b'\xef\xbb\xbf\xef\xbb\xbf<?'

but none seems to work at all... could anyone guide me in the right direction?

kjhughes · Accepted Answer · 2021-01-26T05:43:50.180

Your "XML" file is not well-formed:

It has two BOM characters before the XML declaration.

To remove (at least) one, see Convert UTF-8 with BOM to UTF-8 with no BOM in Python

Remove the & or change it to & on line 43.

<ss:Data ss:Type="String">iShares Core S&P 500 ETF</ss:Data>
                                        ^

Ditto for the & on line 117,188.

Without the above repairs, your file cannot be parsed as XML (because technically it isn't).

After the above repairs, you'll be able to parse it as XML, and Excel will be able to open it:

What's wrong with iShares SP 500 ETF's Excel file?

1 Answers1

See also