I'm dealing with badly laid out excel sheets which I'm trying to parse and write into a database.
Every sheet can have multiple tables. Though the header for these possible tables are known, which tables are gonna be on any given sheet is not, neither is their exact location on the sheet (the tables don't align in a consistent way). I've added a pic of two possible sheet layout to illustrate this: This layout has two tables, while this one has all the tables of the first, but not in the same location, plus an extra table.
What I do know:
- All the possible table headers, so each individual table can be identified by its headers
- Tables are separated by blank cells. They do not touch each other.
My question Is there a clean way to deal with this using some Python module such as pandas?
My current approach:
I'm currently converting to .csv and parsing each row. I split each row around blank cells, and process the first part of the row (should belong to the leftmost table). The remainder of the row is queued and later processed in the same manner. I then read this first_part
and check whether or not it's a header row. If it is, I use it to identify which table I'm dealing with (this is stored in a global current_df
). Subsequent rows which are not header rows are fed into this table (here I'm using pandas.DataFrame
for my tables).
Code so far is below (mostly incomplete and untested, but it should convey the approach above):
class DFManager(object): # keeps track of current table and its headers
current_df = None
current_headers = []
def set_current_df(self, df, headers):
self.current_headers = headers
self.current_df = df
def split_row(row, separator):
while row and row[0] == separator:
row.pop(0)
while row and row[-1] == separator:
row.pop()
if separator in row:
split_index = row.index(separator)
return row[:split_index], row[split_index:]
else:
return row, []
def process_df_row(row, dfmgr):
df = df_with_header(row) # returns the dataframe with these headers
if df is None: # is not a header row, add it to current df
df = dfmgr.current_df
add_row_to_df(row, df)
else:
dfmgr.set_current_df(df, row)
# this is passed the Excel sheet
def populate_dataframes(xl_sheet):
dfmgr = DFManager()
row_queue = Queue()
for row in xl_sheet:
row_queue.put(row)
for row in iter(row_queue.get, None):
if not row:
continue
first_part, remainder = split_row(row)
row_queue.put(remainder)
process_df_row(first_part, dfmgr)