3

I'm trying to read an xlsx file into python using pandas.
I've done this thousands of times before but some reason it is not working with a particular file.

The file is downloaded from another source and I get an AssertionError (see end) when reading with pandas:

df = pandas.read_excel(pathtomyfile, sheetname = "Sheet1")

The variable is defined for the path. The path exists (os.path.exists(path) returns True).

When I copy the contents of the file and paste the values in a new excel doc, this new one will open with the read_excel() method.

When I copy the contents of the file and paste the formatting in a new excel, this new one will open with the read_excel() method.

It doesn't seem to be the values or the formatting.

I am guessing this could be an encoding issue?
Thank you for any help.

    df1 = pandas.read_excel(snap1)
File "C:\Python\python-3.4.4.amd64\lib\site-packages\pandas\io\excel.py", line 163, in read_excel
    io = ExcelFile(io, engine=engine)
File "C:\Python\python-3.4.4.amd64\lib\site-packages\pandas\io\excel.py", line 206, in __init__
    self.book = xlrd.open_workbook(io)
File "C:\Python\python-3.4.4.amd64\lib\site-packages\xlrd\__init__.py", line 422, in open_workbook
    ragged_rows=ragged_rows,
File "C:\Python\python-3.4.4.amd64\lib\site-packages\xlrd\xlsx.py", line 794, in open_workbook_2007_xml
    x12sheet.process_stream(zflo, heading)
File "C:\Python\python-3.4.4.amd64\lib\site-packages\xlrd\xlsx.py", line 531, in own_process_stream
    self_do_row(elem)
File "C:\Python\python-3.4.4.amd64\lib\site-packages\xlrd\xlsx.py", line 597, in do_row
    assert 0 <= self.rowx < X12_MAX_ROWS
AssertionError
Eoin
  • 357
  • 1
  • 4
  • 20
  • based on the assertion check being `0 <= self.rowx < X12_MAX_ROWS` I'm guessing that the `rowx` is either negative (Idk why that would happen) or more then what ever `X12_MAX_ROWS` is, is your spreadsheet remarkably large? – Tadhg McDonald-Jensen May 27 '16 at 14:57
  • 1
    no only 82 rows long, three columns – Eoin May 27 '16 at 14:58
  • then is it possible that there is a stray value way out of bounds in the file? what if you backup the content, select all and delete then paste back in the content? – Tadhg McDonald-Jensen May 27 '16 at 15:01
  • I selected all when I copied to the other excel file.....when I try the read_table method instead it says something about a 0x89 found? Is this possibly a source of error? – Eoin May 27 '16 at 15:04
  • THe error is as follows: UnicodeDecodeError: 'utf-8' codec can't decode byte 0x89 in position 0: invalid start byte – Eoin May 27 '16 at 15:05
  • 1
    I had a similar error, and the xlrd-people seem to be [aware](https://github.com/python-excel/xlrd/issues/56). For me it had to do with the `row=` set to 0 in one of the underlying xml-files. I could solve it by unpacking the xlsx, correcting the value to 1 and rezipping the files. Now I'm looking for a more automatic solution – Maarten Fabré Nov 30 '16 at 11:30
  • Since it is was an encoding error I faced, I used the `read_table()` method in tandem with the `encoding` parameter and "utf-16". This solved the issue as the text in one column was in a foreign language (Korean) and so the characters needed 16bit encoding rather than 8 – Eoin Nov 30 '16 at 14:32
  • @TadhgMcDonald-Jensen What if my spreadsheet does have many rows? >1.1M. I am also facing the same error of assertion. I have no problems reading this file into R. – Jantje Houten May 22 '21 at 19:03

6 Answers6

3

Look at your system for the file xlsx.py.

In your computer it's apparently at C:\Python\python-3.4.4.amd64\lib\site-packages\xlrd\xlsx.py

Search the line :

X12_MAX_ROWS = 2 ** 20

and change it so something like

X12_MAX_ROWS = 2 ** 22

This will push the limit of the number of lines from 1 million lines to 4 million lines.

jmarcio
  • 86
  • 3
2

In my case, I was using xlrd package to read excel and I got the same error of Assertion. Open your xlrd package from site-packages, and from that open sheet.py (https://github.com/python-excel/xlrd/blob/master/xlrd/sheet.py)

Find this code in sheet.py

    if self.biff_version >= 80:
        self.utter_max_rows = 65536
    else:
        self.utter_max_rows = 16384

Convert the above one into...

 #if self.biff_version >= 80:
 self.utter_max_rows = 65536
 #else:
 #      self.utter_max_rows = 16384

Now try to run your program... Problem will be solved..:)

0

Just for completeness, I had a similar problem where the row number of the first row was incorrect, I fixed my problem by changing xlsx-file with code adapted from this answer

def repair_broken_excelfile(zipfname, *filenames, new_name=None):
    # https://stackoverflow.com/a/4653863/1562285
    import tempfile
    import zipfile
    import shutil
    import os
    tempdir = tempfile.mkdtemp()
    try:
        tempname = os.path.join(tempdir, 'new.zip')
        with zipfile.ZipFile(zipfname, 'r') as zipread:
            with zipfile.ZipFile(tempname, 'w') as zipwrite:
                for item in zipread.infolist():
                    print('fn: ' + item.filename)
                    if item.filename not in filenames:
                        data = zipread.read(item.filename)

                        zipwrite.writestr(item, data)
                    else:

                        data = zipread.read(item.filename)
                        data = data.replace(b'<row r="0" spans="">', b'<row r="1" spans="">')
                        zipwrite.writestr(item, data)
                        pass
        if not new_name:
            new_name = zipfname
        shutil.move(tempname, new_name)
    finally:
        shutil.rmtree(tempdir)

Apparently there is a fix underway in xlrd

Community
  • 1
  • 1
Maarten Fabré
  • 6,938
  • 1
  • 17
  • 36
0

Have encountered the same problem, I save file under xml format: "Save as type: XML Spreadsheet 2003" on window. Then I open the file and save as xlsx format. The new file no longer gives error message.

ibn
  • 1
  • 1
0

The file contained Korean characters in the text. These needed alternative encoding. Using the "encoding" parameter in the read_excel() method resolved the issue.

df = pandas.read_excel(pathtomyfile, sheetname = "Sheet1", encoding="utf-16")

Eoin
  • 357
  • 1
  • 4
  • 20
  • there is no parameter encoding for read_excel method. why is this even selected as an answer – Abdulsalam Almahdi Jan 12 '23 at 14:09
  • You may be looking at the current pandas version which doesn't support it. – Eoin Jan 14 '23 at 08:33
  • Go back in time to the version I likely used (this should have been included on the original question) and the method accepts and passes key word arguments to the nested method. https://github.com/pandas-dev/pandas/blob/0.19.x/pandas/io/excel.py – Eoin Jan 14 '23 at 08:35
0

Sometimes this can be resolved just by deleting the (blank) lines below your table in Excel.