5

I'm on a Debian GNU/Linux computer, working with Python 2.7.9.

As a part of my job, I have been making python scripts that read inputs in various formats (e.g. Excel, Csv, Txt) and parse the information to more standarized files. It's not my first time opening or working with Excel files.

There's a particular file which is giving me problems, I just can't open it. When I tried with xlrd (version 0.9.3), it gave me the following error:

xlrd.open_workbook('sample.xls')

XLRDError: Unsupported format, or corrupt file: BOF not workbook/worksheet: op=0x0009 vers=0x0002 strm=0x000a build=0 year=0 -> BIFF21

I tried to investigate the matter on my own, found a couple of answers in StackOverflow but I couldn't open it anyway. This particular answer I found may be the problem (the second explanation), but it doesn't include a workaround: https://stackoverflow.com/a/16518707/4345659

A tool that could conert the file to csv/txt would also solve the problem.

I already tried with:

  • xlrd
  • openpyxl
  • xlsx2csv (the shell tool)

A sample file is available here: https://ufile.io/r4m6j

As a side note, I can open it with LibreOffice Calc and MS Excel, so I could eventually change it to csv that way. The thing is, I need to do it all with a python script.

Thanks in advance!

francisco sollima
  • 7,952
  • 4
  • 22
  • 38
  • Just to double-check - the file is opening in Excel without any issues? – GSazheniuk Oct 20 '17 at 14:18
  • As far as I can tell, yes, I can open it, the cells look normal, nothing looks off. – francisco sollima Oct 20 '17 at 14:34
  • This issue about your `MIME` or `EOF` bytes error. Supported formats not is `repaired formats`, which methods ignore errors/mistakes ? Your file got additional or missing/corrupted data(about of file_system) – dsgdfg Oct 20 '17 at 15:09
  • @franciscosollima where does that Excel file come from? If you are somehow getting it from Windows environment, there is chance that end-of-line characters are "translated" according to Unix standard. One thing you could check is to rename your .xlsx. file to .zip file and try to open it. You should be able to open it and extract the files without any issues. – GSazheniuk Oct 20 '17 at 15:33
  • I get an error when trying to open it after I changed the extension to .zip :( – francisco sollima Oct 20 '17 at 15:36
  • @franciscosollima is there any chance you could share the file? You could strip any sensible information from it. – GSazheniuk Oct 20 '17 at 15:47
  • The thing is, if I open it to remove the sensitive information and I save it, once saved I can open it. And I can't share the file as is. – francisco sollima Oct 20 '17 at 15:50
  • 1
    Nevermind, working on getting a sample! – francisco sollima Oct 20 '17 at 15:52
  • I have tried opening the sample and Excel is telling me it is corrupt. "The workbook cannot be opened or repaired by Microsoft Excel because it is corrupt." – nickromano Oct 23 '17 at 21:19
  • Maybe it’s related to your/mine version of MS Excel. But I definitely could open it, in spite of the fact the file is clearly faulty. – francisco sollima Oct 23 '17 at 23:18
  • It's strange, because `xls2csv` from the package catdoc can't parse the sample document neither with an error message: `./sample.xls is not OLE file or Error` – Shmygol Oct 27 '17 at 21:00
  • And I can open it in LibreOffice Calc – Shmygol Oct 27 '17 at 21:01
  • That’s what throws me off. – francisco sollima Oct 28 '17 at 00:26

9 Answers9

2

It seems like MS Problem. The xls file is very strange, maybe you should contact xlrd support.

But I have a crazy workaround for you: xls2ods. It works for me even though xls2csv doesn't (SiC!).

So, install catdoc first:

$sudo apt-get install catdoc

Then convert your xls file to ods and open ods using pyexcel_ods or whatever you prefer. To use pyexcel_ods install it first using pip install pyexcel_ods.

import subprocess

from pyexcel_ods import get_data

file_basename = 'sample'

returncode = subprocess.call(['xls2ods', '{}.xls'.format(file_basename)])
if returnecode > 0:
    # consider to use subprocess.Popen if you need more control on stderr
    exit(returncode)

data = get_data('{}.ods'.format(file_basename))
print(data)

I'm getting following output:

OrderedDict([(u'sample',
              [[u'labo',
                u'codfarm',
                u'farmacia',
                u'direccion',
                u'localidad',
                u'nom_medico',
                u'matricula',
                u'troquel',
                u'producto',
                u'cant_total']])])
Shmygol
  • 913
  • 7
  • 16
  • In my case, I had to install `pyexcel_ods` from pip because after successfully installing catdoc from apt it still didn't found the pyexcel_ods module. Nevertheless, it works! Thanks a bunch! – francisco sollima Oct 30 '17 at 12:37
  • Glad to help. Yeah, I didn't notice, that `pyexcel_ods` isn't a part of `catdoc`, because `pyexcel_ods` is just an example for parsing ODS Documents. – Shmygol Oct 30 '17 at 13:54
2

Here is a kludge I would use:

Assuming you have LibreOffice on Debian, you could either convert all your *.xls files into *.csv using:

import os 
os.system("libreoffice --headless --convert-to csv *.xls")
#or use os.call

... and then work consistently with csv.

Or you could convert only the corrupted file(s) when needed using a try/except block:

import os 

try:
    xlrd.open_workbook('sample.xls')
except XLRDError:
    os.system("libreoffice --headless --convert-to csv sample.xls")
    # mycsv = open("sample.csv", "r")
    # for line in mycsv.readlines():
    #     ...
    #     ...

OBS: Keep LibreOffice closed while running the script.

Alternatively there are other tools out there to do the conversion. Here is one (which I have not tested): https://github.com/dilshod/xlsx2csv

snake_charmer
  • 2,845
  • 4
  • 26
  • 39
  • When I run `libreoffice --headless --convert-to csv sample.xls` (either from os.system and from shell) no file gets created. No mistake either. It only ends silently and doesn't do anything... – francisco sollima Oct 30 '17 at 12:35
  • The same happened to me when I run the script with libreoffice open. Did you make sure it was closed when you tried it? – snake_charmer Oct 30 '17 at 15:22
  • You were right, it does work, my bad. I will upvote, but I will still accept @Trilliput's answer as it's more simple to apply in my particular situation! Thanks anyway! – francisco sollima Oct 30 '17 at 15:34
0

If you are targeting windows, if you have Excel installed, and if you are familiar with Excel VBA, you will have a quick solution using the comtypes package:

http://pythonhosted.org/comtypes/

You will have direct access to Excel by its COM interfaces.

  • How would I use the package? Can you add to the answer a quick script to actually open the file I provided in python? At least to save it as another format, such as .csv... – francisco sollima Oct 23 '17 at 18:48
  • I could not open the sample.xls with Excel 2013 (says it is corrupt). The repair failed. There is no reason it would work with any other tool. I will see if I can do more. –  Oct 23 '17 at 20:47
0

This code open an xls file and saves it as a cvs file, using the comtypes package:

import comtypes.client as cl
progId = "Excel.Application.15"
xl = cl.CreateObject(progId)
wb = xl.Workbooks.Open(r"C:\Users\aUser\Desktop\thermoList.xls")
wb.SaveAs(r"C:\Users\aUser\Desktop\thermoList.csv",FileFormat=6)
xl.DisplayAlerts = False
xl.Quit()

I could not test it with "sample.xls" which is corrupt. Your could try with another file. You might need to adjust the progId according to your version of Excel.

0

It's a file format issue. I'm not sure what file type is it but it's not Excel. I just open and saved the file with sample2.xls name and compare the types: enter image description here

How are you creating this file?

ChaosPredictor
  • 3,777
  • 1
  • 36
  • 46
0

If you need to get the words as a list of strings:

text_file = open("sample.xls", "r")
lines = text_file.read().replace(chr(200), '').replace(chr(0), '').replace(chr(1), '').replace(chr(5), '').replace(chr(2), '').replace(chr(3), '').replace(chr(4), '').replace(chr(6), '').replace(chr(7), '').replace(chr(8), '').replace(chr(9), '').replace(chr(10), '').replace(chr(12), '').replace(chr(15), '').replace(chr(16), '').replace(chr(17), '').replace(chr(18), '').replace(chr(49), '').replace('Arial', '')
for line in lines.split(chr(128)):
    print(line)

the output: enter image description here

ChaosPredictor
  • 3,777
  • 1
  • 36
  • 46
0

The file you provided is corrupted, so there is no way for other responders to test it and recommend a good solution. And exception you posted confirming that. As a solution you can try to debug some things, please see some steps below:

  • You mentioned you tried the xlrd library. Try to check if your xlrd module is upto date by executing this:

    Python 2.7.9

    >>> import xlrd
    
    >>> xlrd.__VERSION
    

update to the latest official version if needed

  • Try to open any other *.xls file and see if it works with Python version you're using and current library.

  • Check module documentation it's pretty good, and there are some different things described how to use this module on various platforms( Win vs. Linux)http://xlrd.readthedocs.io/en/latest/dates.html

  • You always can rich out to the community (there is still a chance that you might be getting into some weird state or bug) the link is here https://github.com/python-excel/xlrd/issues

Hope that helps.

yadayada
  • 327
  • 1
  • 3
  • 14
0

Unable to open your Excel either. Just as yadayada said, I think it is the problem of data source. If you really want to figure out the reason, I suggest you ask questions about the excel instead of python.

-2

It's always work for me with any xls or xlsx files:

def csv_from_excel(filename_xls, filename_csv):
    wb = xlrd.open_workbook(filename_xls, encoding_override='YOUR_ENCODING_HERE (f.e. "cp1251"')
    sh = wb.sheet_by_index(0)
    your_csv_file = open(filename_csv, 'wb')
    wr = unicodecsv.writer(your_csv_file)
    for rownum in xrange(sh.nrows):
        wr.writerow(sh.row_values(rownum))
    your_csv_file.close()

So, i don't work directly with excel file before convert them to csv. Mb it will help you

Vasily Bronsky
  • 435
  • 2
  • 12