0

I am using openpyxl to copy data from an Excel spreadsheet. The data is a table for an inventory database, where each row is an entry in the database. I read the table one row at a time using a for loop. In order to determine the range of the for loop, I wrote a function that examines each cell in the table to find the height of the table.

Code:

def find_max(self, sheet, row, column):
    max_row = 0
    cell_top = sheet.cell(row = row - 1, column = column)
    while cell_top.value != None:
        cell = sheet.cell(row = row, column = column)
        max = 0
        while cell.value != None or sheet.cell(row = row + 1, column = column).value != None:
            row += 1
            max = max + 1
            cell = sheet.cell(row = row, column = column)
        if max > max_row:
            max_row = max
        cell_top = sheet.cell(row = row, column = column + 1) 

    return max_row

To summarize the function, I move to the next column in the worksheet and then iterate through every cell in that sheet, keeping track of its height until there are no more columns. The catch about this function is that it has to find two empty cells in a row in order to fail the condition. In a previous version I used a similar approach, but only used one column and stopped as soon as I found a blank cell. I had to change it so the program would still run if the user forgot to fill out a column. This function works okay for a small table, but on a table with several hundred entries this makes the program run much slower.

My question is this: What can I do to make this more efficient? I know nesting a while loop like that makes a program take longer but I do not see how to get around it. I have to make the program as foolproof as possible, so I need to check more than one column to stop user errors from failing the program

Son of a Sailor
  • 915
  • 3
  • 14
  • 22

2 Answers2

0

This is untested, but every time I've used openpyxl, I iterate over all rows like so:

for row in active_worksheet:
    do_something_to(row)

so you could count like:

count = 0
for row in active_worksheet:
    count += 1

EDIT: This is a better solution: Is it possible to get an Excel document's row count without loading the entire document into memory?

Community
  • 1
  • 1
Will
  • 4,299
  • 5
  • 32
  • 50
  • The linked solution worked for me. However it turned out that another part of the program was causing the massive slowdown – Son of a Sailor Jun 03 '15 at 18:46
  • I recently found a neat tool for profiling python scripts: https://docs.python.org/2/library/profile.html maybe this can help you speed things up? – Will Jun 03 '15 at 18:50
  • I just used printed `time.clock()` at various points and found the problem area. I have a couple of nest for loops that iterate over every cell in the marked rows. I have about 10 cells in every row and a few hundred rows so thats a few thousand iterations it has to make. If I wasn't grabbing particular rows, I could use your method but I'm not going to tear my hair out over this. – Son of a Sailor Jun 03 '15 at 19:12
  • 1
    In Python you rarely need to create your own counters: `for idx, row in enumerate(ws, 1):` – Charlie Clark Jun 04 '15 at 05:59
0

Read-only mode works row-by-row on the source so you probably want to hook it into it. Alternatively, you could pass the cells of the of a worksheet into something like a Pandas matrix which has indices for empty cells.

Charlie Clark
  • 18,477
  • 4
  • 49
  • 55