5

I have a ton of Excel files that each have a column where numbers are formatted as text. Excel gives the error "the number in this cell is formatted as text or preceded by an apostrophe" - see the third column where the cells have a green triangle.

enter image description here

My goal is to open all of these files in Pandas without having to manually open each of them and convert the column to number. However, pd.read_excel() fails with the following xlrd error:

XLRDError: ZIP file contents not a known type of workbook

Unsurprisingly when I use xlrd directly: wb = xlrd.open_workbook(filename) I get the same error.

I also tried openpyxl: wb = openpyxl.load_workbook(filename), and it gives me this:

KeyError: "There is no item named 'xl/_rels/workbook.xml.rels' in the archive"

I confirmed that the file is openable by both pandas (xlrd) and openpyxl if I manually convert the column to number in excel and re-save the workbook.

Does anyone have any ideas?

Community
  • 1
  • 1
wkzhu
  • 1,616
  • 13
  • 23
  • Is there a chance you could upload a sample? I'm trying to produce the same error, but not quite getting it. – francisco sollima Jan 08 '18 at 19:52
  • you've basically got to use "converters" while reading the excel into a dataframe. Read doc (https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_excel.html) question seems pretty similar to an earlier post. (https://stackoverflow.com/questions/32591466/python-pandas-how-to-specify-data-types-when-reading-an-excel-file). – murphy1310 Jan 08 '18 at 19:54
  • Worth noting that the file is invalid because it's missing the workbook relationships file. – Charlie Clark Jan 09 '18 at 07:46

1 Answers1

3

Use "converters" while reading excel.
Docs.

Eg.:

df = pd.read_excel('yourfile.xlsx',sheetname='sheetname',header=0,converters={ "% Chg" : str })
murphy1310
  • 647
  • 1
  • 6
  • 13