0

I'm trying to read a Excel file (.xlsx) and I'm getting the error "IndexError: list index out of range". My code is simple:

import pandas as pd
pd.read_excel(r'M:\PUBLIC\Felipe Dias\ONS\DIARIO_16-11-2021.xlsx')

The error:


  File "<ipython-input-16-fd0112985376>", line 2, in <module>
    pd.read_excel(r'M:\PUBLIC\Felipe Dias\ONS\DIARIO_16-11-2021.xlsx')

  File "C:\Users\Felipe.dias\Anaconda3\lib\site-packages\pandas\util\_decorators.py", line 311, in wrapper
    return func(*args, **kwargs)

  File "C:\Users\Felipe.dias\Anaconda3\lib\site-packages\pandas\io\excel\_base.py", line 364, in read_excel
    io = ExcelFile(io, storage_options=storage_options, engine=engine)

  File "C:\Users\Felipe.dias\Anaconda3\lib\site-packages\pandas\io\excel\_base.py", line 1233, in __init__
    self._reader = self._engines[engine](self._io, storage_options=storage_options)

  File "C:\Users\Felipe.dias\Anaconda3\lib\site-packages\pandas\io\excel\_openpyxl.py", line 522, in __init__
    super().__init__(filepath_or_buffer, storage_options=storage_options)

  File "C:\Users\Felipe.dias\Anaconda3\lib\site-packages\pandas\io\excel\_base.py", line 420, in __init__
    self.book = self.load_workbook(self.handles.handle)

  File "C:\Users\Felipe.dias\Anaconda3\lib\site-packages\pandas\io\excel\_openpyxl.py", line 533, in load_workbook
    return load_workbook(

  File "C:\Users\Felipe.dias\Anaconda3\lib\site-packages\openpyxl\reader\excel.py", line 317, in load_workbook
    reader.read()

  File "C:\Users\Felipe.dias\Anaconda3\lib\site-packages\openpyxl\reader\excel.py", line 281, in read
    apply_stylesheet(self.archive, self.wb)

  File "C:\Users\Felipe.dias\Anaconda3\lib\site-packages\openpyxl\styles\stylesheet.py", line 198, in apply_stylesheet
    stylesheet = Stylesheet.from_tree(node)

  File "C:\Users\Felipe.dias\Anaconda3\lib\site-packages\openpyxl\styles\stylesheet.py", line 103, in from_tree
    return super(Stylesheet, cls).from_tree(node)

  File "C:\Users\Felipe.dias\Anaconda3\lib\site-packages\openpyxl\descriptors\serialisable.py", line 103, in from_tree
    return cls(**attrib)

  File "C:\Users\Felipe.dias\Anaconda3\lib\site-packages\openpyxl\styles\stylesheet.py", line 94, in __init__
    self.named_styles = self._merge_named_styles()

  File "C:\Users\Felipe.dias\Anaconda3\lib\site-packages\openpyxl\styles\stylesheet.py", line 114, in _merge_named_styles
    self._expand_named_style(style)

  File "C:\Users\Felipe.dias\Anaconda3\lib\site-packages\openpyxl\styles\stylesheet.py", line 124, in _expand_named_style
    xf = self.cellStyleXfs[named_style.xfId]

  File "C:\Users\Felipe.dias\Anaconda3\lib\site-packages\openpyxl\styles\cell_style.py", line 185, in __getitem__
    return self.xf[idx]

IndexError: list index out of range

Maybe the Excel file is not proberly formatted, as stated here and here. But when I manually save the Excel file, the file is a "Excel Workbook (*.xlsx)" not a "Strict XML Open Spreadsheet", like in those other questions:

Print Screen: saving the Excel file manually

I downloaded this file from the web, so maybe the file is broken, but I don't know how to check it.

Thanks for your attention!

Edit 1:

Here is a print screen from the website's HTML

I don't know HTML, but I found strange that the file id is "xls-link" and its href is "./Html/DIARIO_16-11-2021.xlsx".

Like @Wayne said, when he downloaded the file, it came as .xls. After reading this answer, I tried running

pd.read_excel(r'M:\PUBLIC\Felipe Dias\ONS\DIARIO_16-11-2021.xls')

And got the error "[Errno 2] No such file or directory: 'M:\PUBLIC\Felipe Dias\ONS\DIARIO_16-11-2021.xls' "

Then I tried to open the file manually and save it as .xls. After running the code above, it actually worked!

But now my problem is: I will have to manually open and save as .xls all +5000 daily files that I need, which is a tedious quest. Does anyone know how I could do this automatically (without actually open it, because I still can't figure it out)?

femdias
  • 33
  • 1
  • 6
  • I take it there is a lot of code you are attempting on the one file? – stefan_aus_hannover Nov 17 '21 at 20:57
  • @stefan_aus_hannover I don't know what you mean, but I'm doing a web scraping, download a bunch of file and then (trying) to read them one buy one. But I guess it doesn't matter, because even with only these 2 lines of code I get the same error – femdias Nov 17 '21 at 21:01
  • your error shows at least two places that error could have come from. Its generic especially cause its not giving you a line that it happened on. – stefan_aus_hannover Nov 17 '21 at 21:03
  • @stefan_aus_hannover genau, it's a super generic error message – femdias Nov 17 '21 at 21:05
  • I opened in Excel and saved it as `.xls` and then `xls = pd.ExcelFile('DIARIO_16-11-2021.xls')` worked. I could see the sheet names listed using `xls.sheet_names`. Based on https://stackoverflow.com/a/61939376/8508004 and https://stackoverflow.com/questions/26521266/using-pandas-to-pd-read-excel-for-multiple-worksheets-of-the-same-workbook – Wayne Nov 17 '21 at 21:21
  • You can get it with `curl -L -o exhelL_file.xls https://github.com/fomightez/pdbepisa-binder/blob/23d8b81763b0cb6d78a4d4a23de6e36879fb2854/notebooks/tests/DIARIO_16-11-2021.xls?raw=true`. Add an excalmation point in front of that in a Jupyter notebook. It will be called `exhelL_file.xls` if you use that command. If you need a place that will work, use remote computer via https://mybinder.org/v2/gh/jupyterlab/jupyterlab-demo/HEAD?urlpath=lab/tree/demo , select 'terminal' once it launches, run command, and then save it onto your computer using file navigator pane on left. – Wayne Nov 17 '21 at 21:29
  • @Wayne do you have any ideia how to save the file as .xls through Python? Doing it manually is not optimal, because I need to do this with +5000 files – femdias Nov 18 '21 at 13:22
  • Sorry, I presently don't have any idea about that. I was sort of hoping this particular file was an outlier. Are you sure it isn't an outlier? The approaches I tried should normally work with Excel `.xlsx` files. Making me think it was a quirk. – Wayne Nov 18 '21 at 16:29
  • Okay, so [this post](https://github.com/pandas-dev/pandas/issues/39250#issuecomment-762380350) prompted at least some way forward although sadly the fix isn't as easy as in that case because I still don't see what about the worksheets is the issue. But I can see the data because if you rename the file `.xlsx` file you originally were getting to be `.zip` at the end instead of `.xlsx` you can unzip it and see the xml files with the data. – Wayne Nov 18 '21 at 18:12
  • Related to unpacking the xml, I was trying to read the xml files inside with https://stackoverflow.com/questions/33470130/read-excel-xml-xls-file-with-pandas but had no luck. However, I can see the data there in the XML files and so they'd be readable by Python in some manner. What specifically do you need to access in each file? You can probably just email me directly and we can work on parsing out what you'd need off Stackoverflow until we get somewhere. Email is posted in my Github profile and my stackoverflow profile has my GithHub user name. – Wayne Nov 18 '21 at 18:17

0 Answers0