19

I'm having problems reading .xls files written by a Perl script which I have no control over. The files contain some formatting and line breaks within cells.

filename = '/home/shared/testfile.xls'
book = xlrd.open_workbook(filename)
sheet = book.sheet_by_index(0)
for rowIndex in xrange(1, sheet.nrows):
    row = sheet.row(rowIndex)

This is throwing the following error:

_locate_stream(Workbook): seen
    0  5 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
   20  4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
172480= 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
172500  4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 3 2
172520  2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
173840= 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
173860  2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1
173880  1 1 1 1 1 1 1 1
Traceback (most recent call last):
  File "/home/shared/xlrdtest.py", line 5, in <module>
    book = xlrd.open_workbook(filename)
  File "/usr/local/lib/python2.7/site-packages/xlrd/__init__.py", line 443, in open_workbook
    ragged_rows=ragged_rows,
  File "/usr/local/lib/python2.7/site-packages/xlrd/book.py", line 84, in open_workbook_xls
    ragged_rows=ragged_rows,
  File "/usr/local/lib/python2.7/site-packages/xlrd/book.py", line 616, in biff2_8_load
    self.mem, self.base, self.stream_len = cd.locate_named_stream(qname)
  File "/usr/local/lib/python2.7/site-packages/xlrd/compdoc.py", line 393, in locate_named_stream
    d.tot_size, qname, d.DID+6)
  File "/usr/local/lib/python2.7/site-packages/xlrd/compdoc.py", line 421, in _locate_stream
    raise CompDocError("%s corruption: seen[%d] == %d" % (qname, s, self.seen[s]))
xlrd.compdoc.CompDocError: Workbook corruption: seen[2] == 4

I'm not able to find any info about CompDocError or Workbook corruption, even less the seen[2] == 4 part.

snurre
  • 3,045
  • 2
  • 24
  • 31
  • what does 'filename' contains? – juankysmith Oct 03 '12 at 09:31
  • It's just a file path, like 'D:\\testfile.xls' or '/home/shared/testfile.xls' (Updated question) – snurre Oct 03 '12 at 09:35
  • @snurre, try to 'attach' a problematic xls file to this post, so the error would be reproducable – bpgergo Oct 03 '12 at 10:04
  • I'm afraid I can't do that, as they contain confidential information. It should also be noted that I'm able to load the xls files with xlrd if they have been opened and saved with Excel though, so I'm suspecting this is all related to how the Perl script writes and saves the files. – snurre Oct 03 '12 at 11:54
  • 2
    I'm the author of xlrd. What version of xlrd are you running? If you can't send me a copy of your perl-written file under a non-disclosure agreement, the next best thing is if you tell me your email address so that I can send you an investigation script that will display the structure (not the contents) of your file. – John Machin Jan 01 '13 at 00:38
  • @JohnMachin I have run into same problem could you provide me investigation script. – sharafjaffri Jul 26 '13 at 08:03
  • @snurre Did you found any solution? am using latest xrld but same error. – sharafjaffri Jul 26 '13 at 08:04
  • No, sorry. I gave up and eventually got the script providing the data changed to output JSON instead. – snurre Jul 30 '13 at 13:40
  • this worked for me http://www.crimulus.com/2013/09/19/reading-compound-file-binary-format-files-generated-by-phpexcel-with-pythondjango-xlrd/ – pkm Apr 12 '17 at 16:45
  • One example file: https://www.cepea.esalq.usp.br/br/indicador/series/milho.aspx?id=77 – Julio Batista Silva Aug 10 '20 at 19:29

7 Answers7

13

From pkm comment (link) the problem is with a Compound File Binary

#pip install OleFileIO-PL
import OleFileIO_PL
import pandas as pd

path = 'file.xls'
with open(path,'rb') as file:
    ole = OleFileIO_PL.OleFileIO(file)
    if ole.exists('Workbook'):
        d = ole.openstream('Workbook')
        x=pd.read_excel(d,engine='xlrd')
        print(x.head())
holymoly345
  • 131
  • 1
  • 3
10

+1 to Ramiel. Just comment out these lines in compdoc.py (lines 425-27 in xlrd 1.2.0):

if self.seen[s]:
    print("_locate_stream(%s): seen" % qname, file=self.logfile);dump_list(self.seen, 20, self.logfile)
    raise CompDocError("%s corruption: seen[%d] == %d" % (qname, s, self.seen[s]))
Community
  • 1
  • 1
Shatalov Vadim
  • 109
  • 1
  • 3
  • I know this is an old question, however I came across the same error and comment out those lines and those mentioned [here](https://stackoverflow.com/questions/34550624/python-xlrd-read-excel-file-error/34550735) but didn't work. Someone has find any solution yet? – Gustavo Rottgering May 17 '20 at 22:23
  • @GustavoRottgering: See [@holymoly345's answer](https://stackoverflow.com/a/60416081/1424087). – tricasse Dec 14 '20 at 23:48
9

Inspired by this answer, but a bit more straigthforward:

pd.read_excel accepts xlrd.Book objects so you could do:

workbook = xlrd.open_workbook('file_name.xls', ignore_workbook_corruption=True)
excel = pd.read_excel(workbook)
Adrian W
  • 4,563
  • 11
  • 38
  • 52
Alfonso
  • 161
  • 1
  • 4
7

[For pandas users who have trouble with Workbook corruption: seen[2] == 4] when reading XLS files

NOTE: xlrd no longer supports anything other than xls files, see this answer for details and alternatives.

  1. Use the xlrd 2.0.0 or above: https://pypi.org/project/xlrd/.

  2. Add your own XLS engine for pandas in one of your __init__.py files which loads automatically

from pandas.io.excel._xlrd import XlrdReader

class CustomXlrdReader(XlrdReader):

    def load_workbook(self, filepath_or_buffer):
        """Same as original, just uses ignore_workbook_corruption=True)"""
        from xlrd import open_workbook

        if hasattr(filepath_or_buffer, "read"):
            data = filepath_or_buffer.read()
            return open_workbook(file_contents=data, ignore_workbook_corruption=True)
        else:
            return open_workbook(filepath_or_buffer)


ExcelFile._engines['custom_xlrd'] = CustomXlrdReader
print('Monkey patching pandas XLS engines. See CustomXlrdReader')

To use it, specify engine='custom_xlrd' when you read XLS file

df = pd.read_excel(filepath, engine='custom_xlrd')

Here's the corrupted file that xlrd is tested against:

https://github.com/python-excel/xlrd/blob/master/tests/samples/corrupted_error.xls

pymen
  • 5,737
  • 44
  • 35
2

I got same error with one of my .xls files (excel can open them just fine). Problem is located in xlrd compdoc.py. As I guess Compdoc.seen array keeps track of already read "FAT" sectors. In my case Root Entry reading block (SSCS) gets all that sectors marked as seen, leading to exception raise in future. U can try to find the bug in sectors reading logic and contribute to xlrd :) or just comment this lines with exception raise which will likely solve problem in your case (As did in mine) and wait for xlrd update.

glmvrml
  • 1,612
  • 2
  • 14
  • 31
0

Maybe as a last resort you can try and save the .xls file as a .csv file and then try and read it.

Obviously you say you can open it after an open and close from Excel so it's the same effort.

If you realy want your script to open them, then if you are on the windows platform use the pywin32 to open and close Excel from your script, and open and close the file in the same go. Maybe that could work. (silly but could be a work around)

Example (stolen from here)

from win32com.client import Dispatch
xl = Dispatch('Excel.Application')
wb = xl.Workbooks.Open('C:\\Documents and Settings\\GradeBook.xls')
Community
  • 1
  • 1
Tooblippe
  • 3,433
  • 3
  • 17
  • 25
-1

testfile.xls -----> Save As ---> Format 97-2003 --->testfile2.xlc

filename = '/home/shared/testfile2.xls'

OK