-1

I have multiple excel files in my working directory. Few excel file contains multiple sheets, and few of the files contains single sheet. The issue that I have is that these excel files contains data starting at different rows instead of starting at row 1. Below snippet is example of excel file containing multiple sheets.As shown below Sheet 1 has data starting at row 3, whereas sheet 2 has data starting at row 5.Data are at random rows in other excel files. I would to do 3 below tasks.

  1. Create modified copy of the excel file such that the file contains data starting at row 1
  2. Split the sheet if the excel file contains multiple sheets.
  3. Finally, convert the modified sheet to CSV UTF-8

For task point 1 I referenced this link, using code provided by chip. I was able to get data for sheet 1 to move to row 2 in "modifiedsheet" (I can easily move it to row 1 by changing the number), but the issue that I am facing is this is manual approach and I would like to do it dynamically as I have multiple files with row all over the place.

Any help will be really appreciated! Thanks in advance for your time and efforts!

Sheet 1 Image

enter image description here

Sheet 2 Image

enter image description here

Python Code:

from openpyxl import Workbook
from openpyxl import load_workbook

wb = load_workbook("Myexcel_file1.xlsx")

ws1 = wb.active
ws2 = wb.create_sheet("modifiedSheet")

start_row = 3
start_col = 1

for row in ws1.iter_rows(min_row=start_row):
    for cell in row:
        # print(cell.value)
        ws2.cell(row = start_row-2, column = start_col, value=cell.value) # start_row - 2 will assign the value to the same column up 2 rows
        start_col += 1 # increment the column, for use of destination sheet
    start_row += 1 # increment the column, for use of destination sheet
    start_col = 1 # reset to first column after row processing

wb.save("modified.xlsx")
biggboss2019
  • 220
  • 3
  • 8
  • 30

1 Answers1

1

Here's my solution (with some debugging output).

The script opens an excel-file using openpyxl and processes all worksheets in that file.
It searches for leading empty rows, stops on the first non-empty row and then
removes all empty rows found so far using the delete_rows()-function.
See also openpyxl: Inserting and deleting rows and columns, moving ranges of cells

Iff any leading empty rows have been found (wb_modified == True), the modified version of the workbook will be saved as x-cleaned-XLFILE, where x represents the first letter of XLFILE.

delete_leading_empty_rows.py:

import openpyxl

XLFILE = 'test.xlsx'

def main(xls_file=XLFILE):
    '''Delete empty rows from row 1 to row n until 1st non-empty row in all worksheets'''

    def get_first_nonempty_row():
        '''Return number of first non-empty row'''
        print('get_first_nonempty_row, mr=', max_row_in_sheet)
        first_nonempty_row = None
        for row in range(1, max_row_in_sheet + 1):
            print('checking row', row)
            for col in range(1, max_col_in_sheet + 1):
                if sheet.cell(row, col).value is not None:
                    first_nonempty_row = row
                    print('first_nonempty_row =', first_nonempty_row)
                    return first_nonempty_row
        return first_nonempty_row

    def del_rows_before(first_nonempty_row):
        '''Delete empty rows from row 1 up to first non-empty row'''
        modified = False
        if first_nonempty_row > 1:
            print('del_rows_before', first_nonempty_row)
            print('deleting from 1 to', first_nonempty_row - 1)
            modified = True
            sheet.delete_rows(1, first_nonempty_row - 1)
        return modified

    print('opening xlsx-file')
    wb2 = openpyxl.load_workbook(xls_file)
    print('opened', xls_file, '\n')

    wb_modified = False
    for sheet in wb2.worksheets:
        print ('You are currently in', sheet)
        max_row_in_sheet = sheet.max_row
        max_col_in_sheet = sheet.max_column
        print ('max_row=', max_row_in_sheet, 'max_col=', max_col_in_sheet)

        sheet_modified = False
        if max_row_in_sheet > 1:
            first_nonempty_row = get_first_nonempty_row()
            sheet_modified = del_rows_before(first_nonempty_row)

        wb_modified = wb_modified or sheet_modified
        print('sheet modified:', sheet_modified)
        print('wb_modified:', wb_modified)
        print()

    print('finally: wb_modified:', wb_modified)
    if wb_modified:
        f_name = xls_file[0] + '-cleaned-' + xls_file
        print('writing to', f_name)
        wb2.save(f_name)
        wb_modified = False
    wb2.close()

if __name__ == '__main__':
    main()

split_sheets.py:

This script is a simple stand-alone. It reads all .xlsx-files in the current folder and stores each worksheet as a separate file - sheetname1-filename.xlsx, sheetname2-filename.xlsx, and so on.

'''Split excel workbooks in current directory by sheets'''

import glob
import openpyxl

def main():
    '''Split excel workbooks in current directory by sheets'''
    workbooks = glob.iglob('*.xlsx')

    for workbook in workbooks:
        print('reading:', workbook)
        wb2 = openpyxl.load_workbook(workbook)
        for sheet in wb2.worksheets:
            f_name = sheet.title + '-' + workbook
            print('split into', f_name)
            wb2.save(f_name)
        wb2.close()

if __name__ == '__main__':
    main()

delete_leading_empty_rows_glob.py:

This script calls delete_leading_empty_rows_glob() for each .xlsx-file in the current folder. Put it in the folder where delete_leading_empty_rows.py is located.

Execute split_sheets.py first, then execute delete_leading_empty_rows_glob.py.

'''Execute delete_leading_empty_rows.py for all .xlsx files in current directory.'''
import glob
import delete_leading_empty_rows

workbooks = glob.iglob('*.xlsx')
for workbook in workbooks:
    delete_leading_empty_rows.main(workbook)

AcK
  • 2,063
  • 2
  • 20
  • 27
  • Thank you! Will the above code be able to take care of all the excel files in current working directory ? I tried running for one file it worked, but how about all excel files in working directory ? Also, I need to have updated sheets in separate file and not the same file. Thanks in advance! – biggboss2019 Jun 04 '22 at 17:11
  • I tried running !delete_leading_empty_rows.py in my Jupyter NB, and received this error -> Traceback (most recent call last): File "\Files to be tested\delete_leading_empty_rows.py", line 65, in main() File "\Files to be tested\delete_leading_empty_rows.py", line 34, in main wb2 = openpyxl.load_workbook(XLFILE) NameError: name 'openpyxl' is not defined – biggboss2019 Jun 04 '22 at 17:14
  • 1
    @biggboss - sorry, forgot the line `import openpyxl` - see edited answer. BTW, splitting and recursing folders is in progress ;) – AcK Jun 04 '22 at 18:07
  • Thank you very much! I am working on it as well :) – biggboss2019 Jun 04 '22 at 18:08
  • 1
    @biggboss - done. Note the further change in the main file, so the script can receive a filename from the glob-script. – AcK Jun 04 '22 at 18:59
  • Thank you! Few issues I am facing...1) When I am running delete_leading_empty_rows.py: script. I am receving this error -> Traceback (most recent call last): File "C:\\Files to be tested\delete_leading_empty_rows.py", line 75, in get_ipython().system('delete_leading_empty_rows.py') NameError: name 'get_ipython' is not defined 2) When I run split_sheets.py: script It is splitting my original excel files instead of updated excel files. I would like to have the updated excel files splitted into new file – biggboss2019 Jun 04 '22 at 19:28