I am building a library of functions for dealing with ExcelFiles that have multiple sheets, where the strucvture of the sheets are not consistent. A key part of the process is building a dict called workbook_structure that defines for each sheet the start rows, the end rows and the date of each sheet. This is then used in conjunction with pd.ExcelFile.parse()
, to define the structure of the DataFrame from each sheet.
Before concatenating the DFs from each sheet I will perform a number of tests.
Before doing this I need to parse each sheet from an ExcelFile object and store each data frame for testing. My question is, what is the best type of container for storing multiple dataframes? At the moment I am using a dictionary.
I create this container inside a class called workbook_iterator that looks like this:
class workbook_iterator:
def __init__(self, file_path, line_codes, workbook_structure):
"""
file_path : string
line_codes : DataFrame
workbook_structure : dict
Class for extracting data from ExcelFile object, performing tests on each
sheet and concatenating the multiple data frames and saving to csv.
"""
self.__wkbk_DF = pd.ExcelFile(file_path)
self.__line_codes = line_codes
self.__wkbk_struc = workbook_structure
self.All_DFs = {sheet : DataFrame() for sheet in self.__wkbk_DF.sheet_names}
for sheet in self.All_DFs.keys():
len_DF = len(self.__wkbk_DF.parse(sheet))
DF = self.__wkbk_DF.parse(sheet,
header = self.__wkbk_struc['start_rows'][sheet],
skip_footer = len_DF - self.__wkbk_struc['end_rows'][sheet],
parse_cols = self.__wkbk_struc['cols']
)
self.All_DFs[sheet] = DF
pass
This works fine, but the All_DFs attribute of the class object when accessed basically prints out every data frame (and there are hundreds of sheets). Is there a better way to store multiple DataFrames?
NB the line_codes argument is used in subsequent methods, and so understanding its structure is not necessary for the purposes of this question.