Python: Unsupported format, or corrupt file

Question

I am trying to make a python program that downloads and XLS file from a website, in this case website is: https://www.blackrock.com/uk/individual/products/291392/, and loads it as a dataframe in pandas, with the correct data structure.

The issue is that when I try to load it via pandas, it gives me an error: XLRDError: Unsupported format, or corrupt file: Expected BOF record; found b'\xef\xbb\xbf\xef\xbb\xbf<?'

I am not quite sure what is causing this error, but presumable something with the file. I can open the file in Excel, even though I get a warning that the file and the file extension do not match, and that the file might be dangerous etc. If I click yes to opening it anyway, it opens up with data displayed correctly. If I use Excel to save the file as .xlsx i can open it in pandas, but I would rather a solution that didn't require manually opening Excel and saving the file.

I have tried renaming the file extension to xlsx, but this does not work, as it won't allow me to open the file with that extension. I have tried many different extension, but non of them bite - unfortunately.

I am at a loss.

I hope, you can help.

EDIT: The code I use is:

download_path = 'https://www.blackrock.com/uk/individual/products/291392/fund/1527484370694.ajax?fileType=xls&fileName=iShares-MSCI-World-SRI-UCITS-ETF-USD-Dist_fund&dataType=fund'

testing = pd.read_excel(download_path, engine='xlrd', sheet_name = 'Holdings', skiprows = 3)

maybe first download it using `requests` and save in file and open in any text editor to see what you get from server. Maybe you get HTML with warning for spamers/hackers/bots. Or maybe it send HTML with form for login/password. OR maybe server uses compression (ie. zip) and it may need to decompress it. OR maybe server check what web browser was used and send correct data only for real web browsers. — furas, Nov 18 '21 at 23:20

score 2 · Answer 1 · answered Nov 19 '21 at 00:17

The actual problem is that the file format is SpreadSheetML which has only been used briefly between 2003 and 2006. It has been overtaken by the XLSX format. Since, it has been around for a short time and while ago, most packages do not support for load/save operations. More about the format can be found here: https://learn.microsoft.com/en-us/previous-versions/office/developer/office-xp/aa140066(v=office.10)?redirectedfrom=MSDN

For this reason, the Pandas or any other XML parser (e.g Etree) will not be able to load properly. The regular MS Office software would still load it correctly. As far as I know, you can deal with SpreadSheetML files using aspose-cells package: https://products.aspose.com/cells/python-java/

For your case:

# Import packages

import jpype
import asposecells
jpype.startJVM()
from asposecells.api import Workbook, FileFormatType
from asposecells.api import HtmlSaveOptions

# Read Workbook

workbook = Workbook('iShares-MSCI-World-SRI-UCITS-ETF-USD-Dist_fund.xls')
worksheet = workbook.getWorksheets().get(0)

# Accessing a cell using its name

cells = worksheet.getCells()
cell = cells.get("A1")

# Print Message

print("Cell Value: " + str(cell.getValue())) # Prints Cell Value: 17-Nov-2021

# To save SpreadSheetML in different format (HTML)

saveOptions = HtmlSaveOptions()
saveOptions.setDisableDownlevelRevealedComments(True)
workbook.save("iShares-MSCI-World-SRI-UCITS-ETF-USD-Dist_fund.html", saveOptions)

You are absolutely right! That is probably why I couldn't rename it to a non-error-throwing-extension. This is presumably a good solution, but I unfortunately don't have access to the apose product, but thank you for suggesting it! — Christoffer Madsen, Nov 19 '21 at 20:00

score 1 · Accepted Answer · answered Nov 19 '21 at 01:26

As mentioned by Slybot, this is not a real xls file.

If you inspect the contents in a plain text editor, or a hex editor, the header starts:

<?xml version="1.0"?>
<ss:Workbook xmlns:ss="urn:schemas-microsoft-com:office:spreadsheet">

which confirms this is an xml document, and not an Office 2007 zipped xlsx office document.

Your next steps depend on whether you have Excel installed on the machine that will be running this code or not, and if not, what other libraries you have access to and are willing to pay for - Slybot has mentioned aspose for example.

The easiest solution - Excel

If you are running this on a Windows machine with Excel installed, you have the free and capable option of automating the operation of opening Excel and saving as xlsx. This is by using Win32com module, described in this answer:

Attempting to Parse an XLS (XML) File Using Python

Alternatively, save your Excel styled XML as xlsx with Workbook.SaveAs method using win32com (only for Windows users) and read in with pandas.read_excel skipping appropriate rows.

The XML solution

You could read in the raw XML and digest it. The relevant nodes are:

<ss:Workbook>  
<ss:Worksheet ss:Name="Holdings">
<ss:Table>
<ss:Row>
<ss:Cell ss:StyleID="Left">
<ss:Data ss:Type="String">iShares MSCI World SRI UCITS ETF</ss:Data>

The Third-party library solution

I am not familiar with any libraries which provide this functionality, and can't advise on this option.

Totally agree, the easiest would be that you use Excel if available to read convert to XLSX and proceed with regular pipeline. If not available, I would still suggest aspose where you can do the save transformation but with more labor. — Slybot, Nov 19 '21 at 07:31
I ended up going this route, automating the saving part in a function. I had to make another function that saved the file to specific directory first, but this is still a good solution. Thank you. — Christoffer Madsen, Nov 19 '21 at 19:58
@ChristofferMadsen If you are not familiar with the [tempfile](https://docs.python.org/3/library/tempfile.html) module, you should check it out. It is designed for this scenario. — Alan, Nov 20 '21 at 00:35

Python: Unsupported format, or corrupt file

2 Answers2

The easiest solution - Excel

The XML solution

The Third-party library solution