I am trying to open an xls
file (with only one tab) into a pandas dataframe.
It is a file that i can normally read in excel or excel for the web, in fact here is the raw file itself: https://www.dropbox.com/scl/fi/zbxg8ymjp8zxo6k4an4dj/product-screener.xls?dl=0&rlkey=3aw7whab78jeexbdkthkjzkmu .
I notice that the top two rows have merged cells and so do some of the columns.
I have tried several methods (from stack), which all fail.
# method 1 - read excel
file = "C:\\Users\\admin\\Downloads\\product-screener.xls"
df = pd.read_excel(file)
print(df)
error: Excel file format cannot be determined, you must specify an engine manually.
# method 2 - pip install xlrd and use engine
file = "C:\\Users\\admin\\Downloads\\product-screener.xls"
df = pd.read_excel(file, engine='xlrd')
print(df)
error: Unsupported format, or corrupt file: Expected BOF record; found b'\xef\xbb\xbf<?xml'
# method 3 - rename to xlsx and open with openpyxl
file = "C:\\Users\\admin\\Downloads\\product-screener.xlsx"
df = pd.read_excel(file, engine='openpyxl')
print(df)
error: File is not a zip file
(possibly converting, as opposed to renaming, is an option).
# method 4 - use read_xml
file = "C:\\Users\\admin\\Downloads\\product-screener.xls"
df = pd.read_xml(file)
print(df)
this method actually yields a result, but produces a DataFrame that has no meaning in relation to the sheet. presumably one needs to interpret the xml (seems complex) ?
Style Name Table
0 NaN None NaN
1 NaN All funds NaN
# method 5 - use read_table
file = "C:\\Users\\admin\\Downloads\\product-screener.xls"
df = pd.read_table(file)
print(df)
This method reads the file into a one column (series) DataFrame. So how could one use this info to create a standard 2d DataFrame in the same shape as the xls file ?
0 <Workbook xmlns="urn:schemas-microsoft-com:off...
1 <Styles>
2 <Style ss:ID="Default">
3 <Alignment Horizontal="Left"/>
4 </Style>
... ...
226532 </Cell>
226533 </Row>
226534 </Table>
226535 </Worksheet>
226536 </Workbook>
# method 5 - use read_html
file = "C:\\Users\\admin\\Downloads\\product-screener.xls"
df = pd.read_html(file)
print(df)
this returns a blank list []
whereas one might have expected at least a list of DataFrames.
So the question is what is the easiest method to read this file into a dataframe (or similar usable format) ?