Python Find highest row in a given column

Question

I'm quite new in stackoverflow and quite recently learnt some basic Python. This is the first time I'm using openpyxl. Before I used xlrd and xlsxwriter and I did manage to make some useful programs. But right now I need a .xlsx reader&writer.

There is a File which I need to read and edit with data already stored in the code. Let's suppose the .xlsx has five columns with data: A, B, C, D, E. In column A, I've over 1000 rows with data. On Column D, I've 150 rows with data.

Basically, I want the program to find the last row with data on a given column (say D). Then, write the stored variable data in the next available row (last row + 1) in column D.

The problem is that I can't use ws.get_highest_row() because it returns the row 1000 on column A.

Basically, so far this is all I've got:

data = 'xxx'
from openpyxl import load_workbook
wb = load_workbook('book.xlsx', use_iterators=True)
ws = wb.get_sheet_by_name('Sheet1')
last_row = ws.get_highest_row()

Obviously this doesn't work at all. last_row returns 1000.

Do you have to use `openpyxl` or could you use another library? This looks like it might be a job for `pandas`, Python's data processing library. — LondonRob, Jul 03 '15 at 18:44
Anything would be perfect but the file I'm working is an xlsx. I read that openpyxl is the only one that both reads and write. — egodial, Jul 03 '15 at 18:48

alecxe · Answer 1 · 2015-07-03T19:05:05.700

2

The problem is that get_highest_row() itself uses row dimensions instances to define the maximum row in the sheet. RowDimension has no information about the columns - which means we cannot use it to solve your problem and have to approach it differently.

Here is one kind of "ugly" openpyxl-specific option that though would not work if use_iterators=True:

from openpyxl.utils import coordinate_from_string

def get_maximum_row(ws, column):
    return max(coordinate_from_string(cell)[-1]
               for cell in ws._cells if cell.startswith(column))

Usage:

print get_maximum_row(ws, "A")
print get_maximum_row(ws, "B")
print get_maximum_row(ws, "C")
print get_maximum_row(ws, "D")

Aside from this, I would follow the @LondonRob's suggestion to parse the contents with pandas and let it do the job.

edited Jul 03 '15 at 19:05

answered Jul 03 '15 at 18:52

alecxe

462,703
120
1,088
1,195

I did gave it a try updating the `wb = load_workbook('book.xlsx')`. However I got this **Traceback (most recent call last): File "C:\xx.py", line 53, in print get_maximum_row(ws, "A") File "C:\xx.py", line 51, in get_maximum_row for cell in ws._cells if cell.startswith(column)) File "C:\xx.py", line 51, in for cell in ws._cells if cell.startswith(column)) NameError: global name 'coordinate_from_string' is not defined** – egodial Jul 03 '15 at 19:03
Maybe I copied something wrong here but it returns the same value for any Col. `from openpyxl import load_workbook wb = load_workbook('libro.xlsx') ws = wb.get_sheet_by_name('Sheet1') from openpyxl.utils import coordinate_from_string def get_maximum_row(ws, column): return max(coordinate_from_string(cell)[-1] for cell in ws._cells if cell.startswith(column)) print get_maximum_row(ws, "A") print get_maximum_row(ws, "B") print get_maximum_row(ws, "C") print get_maximum_row(ws, "F")` Thanks for the hardwork with this code! – egodial Jul 03 '15 at 20:22

score 1 · Accepted Answer · edited May 23 '17 at 11:44

Here's how to do it using Pandas.

It's easy to get the last non-null row in Pandas using last_valid_index.

There might be a better way to write the resulting DataFrame to your xlsx file but, according to the docs, this very dumb way is actually how it's done in openpyxl.

Let's say you're starting with this simple worksheet:

Original worksheet

Let's say we want to put xxx into column C:

import openpyxl as xl
import pandas as pd

wb = xl.load_workbook('deleteme.xlsx')
ws = wb.get_sheet_by_name('Sheet1')
df = pd.read_excel('deleteme.xlsx')

def replace_first_null(df, col_name, value):
    """
    Replace the first null value in DataFrame df.`col_name`
    with `value`.
    """
    return_df = df.copy()
    idx = list(df.index)
    last_valid = df[col_name].last_valid_index()
    last_valid_row_number = idx.index(last_valid)
    # This next line has mixed number and string indexing
    # but it should be ok, since df is coming from an
    # Excel sheet and should have a consecutive index
    return_df.loc[last_valid_row_number + 1, col_name] = value
    return return_df

def write_df_to_worksheet(ws, df):
    """
    Write the values in df to the worksheet ws in place
    """
    for i, col in enumerate(replaced):
        for j, val in enumerate(replaced[col]):
            if not pd.isnull(val):
                # Python is zero indexed, so add one
                # (plus an extra one to take account
                #  of the header row!)
                ws.cell(row=j + 2, column=i + 1).value = val

# Here's the actual replacing happening
replaced = replace_first_null(df, 'C', 'xxx')
write_df_to_worksheet(ws, df)
wb.save('changed.xlsx')

which results in:

Edited Excel file

This code is awesome, @LondonRob !. It has a bit of an odd behavior with the .xlsx, e.g. it hides columns (a lot of them). Why? Also, I'm trying to edit two things but I will have to study this code further to do it: i) I need the code to find the last value in C, write 'xxx' and, for example, write on the same row in the next two columns 'yyy' and 'zzz' ii) as far as I can see, it depends too much on the columns' header. It is optimal if it works with the column location or an index (this seems quite hard to do). — egodial, Jul 03 '15 at 20:53
Glad to help! These sound like new questions to me. You'll get the best help by posting them separately to this question. Good luck! — LondonRob, Jul 03 '15 at 20:56

score 1 · Answer 3 · answered Jun 11 '17 at 17:26

1

alexce's solution didn't work for me. It's probably a question of openpyxl version, I'm on 2.4.1, here's what worked after a small tweak:

def get_max_row_in_col(ws, column):
    return max([cell[0] for cell in ws._cells if cell[1] == column])

answered Jun 11 '17 at 17:26

rbrtk

117
1
7

score 0 · Answer 4 · answered Jul 03 '15 at 19:02

If this is a limitation of openpyxl then you might try one of the following approaches:

convert the Excel file to csv and use the Python csv module.
uncompress the Excel file using zipfile and then navigate to the "xl/worksheets" subfolder of the uncompressed file, and there you will find an XML for each of your worksheets. From there you could parse and update with BeautifulSoup or lxml.

The xslx Excel format is a compressed (zipped) tree folder of XML files. You can find the specification here.

Jimbo · Answer 5 · 2016-08-18T13:49:26.377

Figure I'll start giving back to the stackoverflow community. Alecxe's solution didn't work for me and I didn't want to use Pandas etc so I did this instead. It checks from the end of the spreadsheet and gives you the next available/empty row in column D.

def unassigned_row_in_column_D(): 
    ws_max_row = int(ws.max_row)
    cell_coord = 'D' + str(ws_max_row)
    while ws.cell(cell_coord).value == None:
        ws_max_row -= 1
        cell_coord = 'D' + str(ws_max_row)
    ws_max_row += 1
    return 'D' + str(ws_max_row)

#then add variable data = 'xxx' to that cell

ws.cell(unassigned_row_in_column_D()).value = data

Python Find highest row in a given column

5 Answers5