1

I am trying to open an xls file (with only one tab) into a pandas dataframe.

It is a file that i can normally read in excel or excel for the web, in fact here is the raw file itself: https://www.dropbox.com/scl/fi/zbxg8ymjp8zxo6k4an4dj/product-screener.xls?dl=0&rlkey=3aw7whab78jeexbdkthkjzkmu .

I notice that the top two rows have merged cells and so do some of the columns.

I have tried several methods (from stack), which all fail.

# method 1 - read excel
file = "C:\\Users\\admin\\Downloads\\product-screener.xls"
df = pd.read_excel(file)
print(df)

error: Excel file format cannot be determined, you must specify an engine manually.

# method 2 - pip install xlrd and use engine
file = "C:\\Users\\admin\\Downloads\\product-screener.xls"
df = pd.read_excel(file, engine='xlrd')
print(df)

error: Unsupported format, or corrupt file: Expected BOF record; found b'\xef\xbb\xbf<?xml'

# method 3 - rename to xlsx and open with openpyxl
file = "C:\\Users\\admin\\Downloads\\product-screener.xlsx"
df = pd.read_excel(file, engine='openpyxl')
print(df)

error: File is not a zip file (possibly converting, as opposed to renaming, is an option).

# method 4 - use read_xml
file = "C:\\Users\\admin\\Downloads\\product-screener.xls"
df = pd.read_xml(file)
print(df)

this method actually yields a result, but produces a DataFrame that has no meaning in relation to the sheet. presumably one needs to interpret the xml (seems complex) ?

   Style       Name  Table
0    NaN       None    NaN
1    NaN  All funds    NaN


# method 5 - use read_table
file = "C:\\Users\\admin\\Downloads\\product-screener.xls"
df = pd.read_table(file)
print(df)

This method reads the file into a one column (series) DataFrame. So how could one use this info to create a standard 2d DataFrame in the same shape as the xls file ?

0       <Workbook xmlns="urn:schemas-microsoft-com:off...
1                                                <Styles>
2                                 <Style ss:ID="Default">
3                          <Alignment Horizontal="Left"/>
4                                                </Style>
...                                                   ...
226532                                            </Cell>
226533                                             </Row>
226534                                           </Table>
226535                                       </Worksheet>
226536                                        </Workbook>



# method 5 - use read_html
file = "C:\\Users\\admin\\Downloads\\product-screener.xls"
df = pd.read_html(file)
print(df)

this returns a blank list [] whereas one might have expected at least a list of DataFrames.

So the question is what is the easiest method to read this file into a dataframe (or similar usable format) ?

D.L
  • 4,339
  • 5
  • 22
  • 45
  • link to file (on google docs) here too: https://docs.google.com/spreadsheets/d/1YPaFg-cLJCpJb5EhEGtV9TcDSgQaKxVo/edit?usp=sharing&ouid=106649036151775603005&rtpof=true&sd=true – D.L Feb 02 '22 at 11:10
  • 1
    You `"xls"` file is actually an `xml Spreadsheet` file (just view it with e.g. notepad). Pandas/openpyxl can't read this kind of spreadsheet file. You can try [this solution](https://stackoverflow.com/a/33504236/3944322) as a workaround (I didn't test it though). – Stef Feb 02 '22 at 11:56
  • this appears to be a closer solution (hence the upvote), but not complete yet. A small portion (from rows 800 to 870), **but not all** of the data is now parsed into the DataFrame. The headers are missing too. So the question is how to capture all the data with the method ? – D.L Feb 02 '22 at 13:39

2 Answers2

1

Not a complete solution but it should get you started. The "xls" file is actually a plain xml file in the SpreadsheetML format. Change the file extension to .xml an view it in your internet browser, the structure (at least of the give file) is rather straightforward.

The following reads the data contents into a pandas DataFrame:

import pandas as pd
import xml.etree.ElementTree as ET

tree = ET.parse('product-screener.xls')
root = tree.getroot()

data = [[c[0].text for c in r] for r in root[1][0][2:]]
types = [c[0].get('{urn:schemas-microsoft-com:office:spreadsheet}Type') for c in root[1][0][2]]

df = pd.DataFrame(data)
df = df.replace('-', None)
for c in df.columns:
    if types[c] == 'Number':
        df[c] = pd.to_numeric(df[c])
    elif types[c] == 'DateTime':
        df[c] = pd.to_datetime(df[c])

Getting the column names from rows 0 and 1 is a bit more involved due to the merged cells - I leave it as an exercise for the reader .

Stef
  • 28,728
  • 2
  • 24
  • 52
  • This method works (aside form the headers). Presumably one needs to include the rename (or a copy) of `xls` file to `xml` file and then delete the `xml` once finished with. I will try to include this in a more complete method. I have voted this up as it is, so far, the best answer. – D.L Feb 02 '22 at 16:53
  • I added `shutil.copyfile(file_xls, file_xml)` to transform xls to xml. This completes the process aside from the headers. A final edit of the header rows (1 & 2) would complete. – D.L Feb 02 '22 at 17:03
0

I am posting the full solution here which contains the above approved solution (by @Stef) plus the final addition of the headers into the DataFrame.

'''
get xls file
convert to xml
parse into dataframe
add headers

'''


import pandas as pd
import xml.etree.ElementTree as ET

import shutil

file_xls = "C:\\Users\\admin\\Downloads\\product-screener.xls"
file_xml = 'C:\\Users\\admin\\Downloads\\product-screener.xml'

shutil.copyfile(file_xls, file_xml)

tree = ET.parse(file_xml)
root = tree.getroot()

data = [[c[0].text for c in r] for r in root[1][0][2:]]
types = [c[0].get('{urn:schemas-microsoft-com:office:spreadsheet}Type') for c in root[1][0][2]]

df = pd.DataFrame(data)
df = df.replace('-', None)
for c in df.columns:
    if types[c] == 'Number':
        df[c] = pd.to_numeric(df[c])
    elif types[c] == 'DateTime':
        df[c] = pd.to_datetime(df[c])

print(df)


headers = [[c[0].text for c in r] for r in root[1][0][:2]]
# print(headers[0])
# print(len(headers[0]))
# print()
# print(headers[1])
# print(len(headers[1]))
# print()

# upto column (AF) comes from headers[0]
df_headers = headers[0][0:32]
# the next 9 are discrete
x_list = ['discrete: ' + s for s in headers[1][0:9]  ]
df_headers = df_headers + x_list

# the next 10 are annualised
x_list = ['annualised: ' + s for s in headers[1][9:19]  ]
df_headers = df_headers + x_list

# the next 10 are cumulative
x_list = ['cumulative: ' + s for s in headers[1][19:29]  ]
df_headers = df_headers + x_list

# the next 9 are calendar
x_list = ['calendar: ' + s for s in headers[1][29:38]  ]
df_headers = df_headers + x_list

# the next 5 are portfolio characteristics (metrics)
x_list = ['metrics: ' + s for s in headers[1][38:43]  ]
df_headers = df_headers + x_list

# the next 6 are portfolio characteristics
x_list = ['characteristics: ' + s for s in headers[1][43:49]  ]
df_headers = df_headers + x_list

# the final 5 are sustainability characteristics
x_list = ['sustain: ' + s for s in headers[1][49:54]  ]
df_headers = df_headers + x_list
print(df_headers)

# add headers to dataframe
df.columns = df_headers
print(df)
D.L
  • 4,339
  • 5
  • 22
  • 45