1

I am trying to parse iShares SP 500 ETF's Excel file, which looks like this:

<?xml version="1.0"?>
<ss:Workbook xmlns:ss="urn:schemas-microsoft-com:office:spreadsheet">
<ss:Styles>
<ss:Style ss:ID="Default">
<ss:Alignment ss:Horizontal="Left"/>
</ss:Style>
...

It seems to be an old XLS Excel type file, but it is an XML file, yet xml.etree.ElementTree is complaining a lot.

I have tried:

import xml.etree.ElementTree as ET
tree = ET.parse(file_name)

and with encoding:

import xml.etree.ElementTree as ET
tree = ET.parse(file_name, parser=ET.XMLParser(recover=True))
import xml.etree.ElementTree as ET
tree = ET.parse(file_name, parser=ET.XMLParser(encoding='utf-8'))

error:

xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 1, column 1
rb = xlrd.open_workbook(file_name, encoding_override='utf-8')
print(rb)

error:

xlrd.biffh.XLRDError: Unsupported format, or corrupt file: Expected BOF record; found b'\xef\xbb\xbf\xef\xbb\xbf<?'

but none seems to work at all... could anyone guide me in the right direction?

kjhughes
  • 106,133
  • 27
  • 181
  • 240
ted_chou12
  • 15
  • 5

1 Answers1

2

Your "XML" file is not well-formed:

  1. It has two BOM characters before the XML declaration.

    To remove (at least) one, see Convert UTF-8 with BOM to UTF-8 with no BOM in Python

  2. Remove the & or change it to &amp; on line 43.

    <ss:Data ss:Type="String">iShares Core S&P 500 ETF</ss:Data>
                                            ^
    
  3. Ditto for the & on line 117,188.

Without the above repairs, your file cannot be parsed as XML (because technically it isn't).

After the above repairs, you'll be able to parse it as XML, and Excel will be able to open it:

enter image description here

See also

kjhughes
  • 106,133
  • 27
  • 181
  • 240
  • 1
    Thank you so much! Points are precisely on the spot and fixed the file entirely! Thank you so much again! – ted_chou12 Jan 27 '21 at 11:34