4

I have an excelsheet with multiple tables on the same sheet. These tables are of varying column numbers and varying row numbers. The good thing is however that the table header has a background color, and the table contents are of white background.

I am wondering if I could read in each of these tables as separate dataframes using xlrd or some other package.

The approach that am thinking of right now is pretty lengthy and may not be ideal.

For instance :

import xlrd
book = xlrd.open_workbook("some.xls", formatting_info=True)
sheets = book.sheet_names()
for index, sh in enumerate(sheets):
    sheet = book.sheet_by_index(index)
    rows, cols = sheet.nrows, sheet.ncols
    for row in range(rows):
         for col in range(cols):
             xfx = sheet.cell_xf_index(row, col)
             xf = book.xf_list[xfx]
             bgx = xf.background.pattern_colour_index
             if bgx != 64:
                 Header_row = rownum

Then traverse this Header_row and fetch all the column values and have them as dataframe column names. then continue parsing rows of the first column until you encounter a blank cell or a row which has only one or two non-null cells.

As you can see this is getting little lengthy and may not be the optimal way.

Appreciate your help in how can quickly pull out all the tales as separate dataframes. enter image description here

asimo
  • 2,340
  • 11
  • 29

1 Answers1

5

I think you have the right approach. Just put it in some functions to improve clarity.

Probably something like that:

import xlrd
# from typing import Dict

book = xlrd.open_workbook("some.xls", formatting_info=True)

def is_header(sheet, row, col, exclude_color=64):
    xf_index = sheet.cell_xf_index(row, col)
    bg_color = book.xf_list[xf_index].background.pattern_colour_index
    return bg_color != 64

def parse_sheet(sheet):
    """Parse a sheet and retrieve data as dataframe"""
    column_headers = dict()  # type: Dict[int, str]

    for row in range(sheet.nrows):
        # We skip rows if first cell is not a header and has no value
        # TODO: Remove if that skips line 13 as well
        if not is_header(sheet, row, 0) and not sheet.cell_value(row, 0):
            column_headers.clear()
            continue

        # Otherwise, we will populate the list of headers for column
        # And we will parse other data
        c_headers = [c for c in range(sheet.ncols) if is_header(sheet, row, c)]
        if c_headers:
            for col in c_headers:
                column_headers[col] = sheet.cell_value(row, col)

        else:
            for col in range(sheet.ncols):
                value = sheet.cell_value(row, col)
                # TODO: Add data in the dataframe and use column headers

for index in range(book.sheet_names()):
    parse_sheet(book.sheet_by_index(index))
Jean-Francois T.
  • 11,549
  • 7
  • 68
  • 107