Is it possible to get an Excel document's row count without loading the entire document into memory?

Question

I'm working on an application that processes huge Excel 2007 files, and I'm using OpenPyXL to do it. OpenPyXL has two different methods of reading an Excel file - one "normal" method where the entire document is loaded into memory at once, and one method where iterators are used to read row-by-row.

The problem is that when I'm using the iterator method, I don't get any document meta-data like column widths and row/column count, and i really need this data. I assume this data is stored in the Excel document close to the top, so it shouldn't be necessary to load the whole 10MB file into memory to get access to it.

So, is there a way to get ahold of the row/column count and column widths without loading the entire document into memory first?

I have a feeling that if you have huge Excel files, you're likely using Excel for a task it is unsuited to. — Markus, Nov 14 '12 at 12:24
@Markus: That's not really relevant. My boss is using Excel, I'm just writing this script for him. — Hubro, Nov 14 '12 at 12:48
In any case, I had a browse through openpyxl, and it doesn't seem to load the column dimensions for IterableWorksheet. If you load the whole thing at once you can get the dimensions like worksheet.column_dimensions["A"].width, however the column_dimensions dict is completely unpopulated for the iterable worksheet. :-/ It looks like the newer excel documents are just XML so you could in theory use that to look for your column elements and extract the info directly, but it's a hassle. — Markus, Nov 15 '12 at 11:38
@MadPhysicist 10MB is actually moderately big for xlsx files. Remember, they are compressed XML. So a 10MB xlsx could potentially unpack to >100mb when loaded (especially if it contains non-primitive objects). I have worked with XLSX in the 90MB range though... — cowbert, Dec 07 '18 at 18:44
@cowbert. I agree with "moderately big as far as XLS files go". For practical purposes, we are dealing with multiple GB of RAM though. — Mad Physicist, Dec 07 '18 at 18:52

score 145 · Accepted Answer · edited Oct 01 '15 at 15:06

145

Adding on to what Hubro said, apparently get_highest_row() has been deprecated. Using the max_row and max_column properties returns the row and column count. For example:

    wb = load_workbook(path, use_iterators=True)
    sheet = wb.worksheets[0]

    row_count = sheet.max_row
    column_count = sheet.max_column

edited Oct 01 '15 at 15:06

enigma

3,476
2
17
30

answered Sep 10 '15 at 22:52

dransom90

1,600
1
11
8

1

`max_row` and `max_column` didn't work for `sheet = wb.active`. I am using `openpyxl==2.4.8` – Hussain Apr 09 '18 at 10:15
@Hussain: What value did you get and what did you expect? And what about `sheet = wb.worksheets[0]`? – antiplex Aug 30 '18 at 08:29
3

@Hussain `sheet = wb.active` worked fine for me using that version – bigsee Dec 13 '18 at 16:06
1

but in this case, you're counting None valued cells as well, I tried to loop through the columns instead, I know it is not the best way. but it is useful for me. – Karam Qusai Dec 06 '19 at 08:43

score 19 · Answer 2 · edited Jun 20 '20 at 09:12

19

The solution suggested in this answer has been deprecated, and might no longer work.

Taking a look at the source code of OpenPyXL (IterableWorksheet) I've figured out how to get the column and row count from an iterator worksheet:

wb = load_workbook(path, use_iterators=True)
sheet = wb.worksheets[0]

row_count = sheet.get_highest_row() - 1
column_count = letter_to_index(sheet.get_highest_column()) + 1

IterableWorksheet.get_highest_column returns a string with the column letter that you can see in Excel, e.g. "A", "B", "C" etc. Therefore I've also written a function to translate the column letter to a zero based index:

def letter_to_index(letter):
    """Converts a column letter, e.g. "A", "B", "AA", "BC" etc. to a zero based
    column index.

    A becomes 0, B becomes 1, Z becomes 25, AA becomes 26 etc.

    Args:
        letter (str): The column index letter.
    Returns:
        The column index as an integer.
    """
    letter = letter.upper()
    result = 0

    for index, char in enumerate(reversed(letter)):
        # Get the ASCII number of the letter and subtract 64 so that A
        # corresponds to 1.
        num = ord(char) - 64

        # Multiply the number with 26 to the power of `index` to get the correct
        # value of the letter based on it's index in the string.
        final_num = (26 ** index) * num

        result += final_num

    # Subtract 1 from the result to make it zero-based before returning.
    return result - 1

I still haven't figured out how to get the column sizes though, so I've decided to use a fixed-width font and automatically scaled columns in my application.

edited Jun 20 '20 at 09:12

Community

1
1

answered Nov 14 '12 at 14:32

Hubro

56,214
69
228
381

In hindsight, subtracting 1 from `sheet.get_highest_row()` to get the row count was probably incorrect. Since the row numbers were 1-based, and not 0-based, the highest row index would also be the row count. There might have been a valid reason for subtracting 1 though, I can't recall. – Hubro Aug 12 '15 at 10:50
It's worth noting that the metadata about a worksheet's size is not always available. In which case you do have to go through the whole worksheet. – Charlie Clark Sep 11 '15 at 07:28
@CharlieClark When would it be unavailable? – Hubro Sep 11 '15 at 10:14
2

Well, for precisely the same reason as it's desirable to have this information early in the file when reading, it's impractical to put it there when streaming to a file. This is why it's considered an optional feature in the spec. So you won't find it in exports from Google or in openpyxl's own write-only mode. A better solution would be to include the metadata elsewhere in the package but we're stuck with specification as it is. openpyxl will let you know if the data is missing and let you calculate it (by reading in the whole worksheet). – Charlie Clark Sep 11 '15 at 12:56
@CharlieClark Thanks, I didn't know that. Luckily my application only has to support Excel files saved by the Excel application. – Hubro Sep 11 '15 at 16:19
Maybe there's a clever way to do it – the packaging format does sort of support streaming which something hierarchical like XML doesn't do well – but Microsoft is best placed to know. It looks like internally it still does a lot of stuff the "good, old way" while everyone else tries to work with the specification. – Charlie Clark Sep 11 '15 at 16:56
@Hubro get_highest_row() has been deprecated.and no longer works in openpyxl-2.4.1 This answer should be updated. – sabbahillel Dec 23 '16 at 18:21
@sabbahillel Thanks, I've changed the accepted answer – Hubro Dec 23 '16 at 20:47

score 8 · Answer 3 · answered Oct 15 '19 at 07:55

8

Python 3

import openpyxl as xl

wb = xl.load_workbook("Sample.xlsx", enumerate)

#the 2 lines under do the same. 
sheet = wb.get_sheet_by_name('sheet') 
sheet = wb.worksheets[0]

row_count = sheet.max_row
column_count = sheet.max_column

#this works fore me.

answered Oct 15 '19 at 07:55

Anders Stamnes

81
1
2

1

Why is there an `enumerate` as second argument of `load_workbook()`. Can't find that in the docu. – buhtz Dec 07 '22 at 14:15

score 3 · Answer 4 · answered Nov 15 '12 at 12:27

This might be extremely convoluted and I might be missing the obvious, but without OpenPyXL filling in the column_dimensions in Iterable Worksheets (see my comment above), the only way I can see of finding the column size without loading everything is to parse the xml directly:

from xml.etree.ElementTree import iterparse
from openpyxl import load_workbook
wb=load_workbook("/path/to/workbook.xlsx", use_iterators=True)
ws=wb.worksheets[0]
xml = ws._xml_source
xml.seek(0)

for _,x in iterparse(xml):

    name= x.tag.split("}")[-1]
    if name=="col":
        print "Column %(max)s: Width: %(width)s"%x.attrib # width = x.attrib["width"]

    if name=="cols":
        print "break before reading the rest of the file"
        break

Also, this is probably bad style for parsing xml, I haven't done that before. — Markus, Nov 15 '12 at 12:30

score 0 · Answer 5 · answered Jan 08 '16 at 03:41

0

https://pythonhosted.org/pyexcel/iapi/pyexcel.sheets.Sheet.html see : row_range() Utility function to get row range

if you use pyexcel, can call row_range get max rows.

python 3.4 test pass.

answered Jan 08 '16 at 03:41

delphisharp

111
3

python 3.4 test ok. – delphisharp Jan 08 '16 at 03:42

score 0 · Answer 6 · answered Jul 21 '21 at 14:57

Options using pandas.

Gets all sheetnames with count of rows and columns.

import pandas as pd
xl = pd.ExcelFile('file.xlsx')
sheetnames = xl.sheet_names
for sheet in sheetnames:
    df = xl.parse(sheet)
    dimensions = df.shape
    print('sheetname', ' --> ', dimensions)

Single sheet count of rows and columns.

import pandas as pd
xl = pd.ExcelFile('file.xlsx')
sheetnames = xl.sheet_names
df = xl.parse(sheetnames[0])   # [0] get first tab/sheet.
dimensions = df.shape
print(f'sheetname: "{sheetnames[0]}" - -> {dimensions}')

output sheetname "Sheet1" --> (row count, column count)

This was very slow for me on a large dataset. It doesn't seem to load the whole file into memory, but does seem to parse it fully? — Danferno, Oct 24 '22 at 09:42

Is it possible to get an Excel document's row count without loading the entire document into memory?

6 Answers6

The solution suggested in this answer has been deprecated, and might no longer work.

Linked