2

I am building a library of functions for dealing with ExcelFiles that have multiple sheets, where the strucvture of the sheets are not consistent. A key part of the process is building a dict called workbook_structure that defines for each sheet the start rows, the end rows and the date of each sheet. This is then used in conjunction with pd.ExcelFile.parse(), to define the structure of the DataFrame from each sheet.

Before concatenating the DFs from each sheet I will perform a number of tests.

Before doing this I need to parse each sheet from an ExcelFile object and store each data frame for testing. My question is, what is the best type of container for storing multiple dataframes? At the moment I am using a dictionary.

I create this container inside a class called workbook_iterator that looks like this:

class workbook_iterator:
    def __init__(self, file_path, line_codes, workbook_structure):
        """
        file_path           : string
        line_codes          : DataFrame
        workbook_structure  : dict

        Class for extracting data from ExcelFile object, performing tests on each
        sheet and concatenating the multiple data frames and saving to csv. 
        """     
        self.__wkbk_DF = pd.ExcelFile(file_path)
        self.__line_codes = line_codes
        self.__wkbk_struc = workbook_structure
        self.All_DFs = {sheet : DataFrame() for sheet in self.__wkbk_DF.sheet_names}
        for sheet in self.All_DFs.keys():
            len_DF = len(self.__wkbk_DF.parse(sheet))
            DF = self.__wkbk_DF.parse(sheet,
                                      header = self.__wkbk_struc['start_rows'][sheet],
                                      skip_footer = len_DF - self.__wkbk_struc['end_rows'][sheet],
                                      parse_cols = self.__wkbk_struc['cols']
                                     ) 
            self.All_DFs[sheet] = DF

    pass

This works fine, but the All_DFs attribute of the class object when accessed basically prints out every data frame (and there are hundreds of sheets). Is there a better way to store multiple DataFrames?

NB the line_codes argument is used in subsequent methods, and so understanding its structure is not necessary for the purposes of this question.

Woody Pride
  • 13,539
  • 9
  • 48
  • 62
  • I think a dict is a fine way to store multiple dataframes (apart from the printing issue). Unless you have eg a common index among the different dataframes, then you could look at multi-index dataframes. – joris Apr 29 '14 at 08:12
  • Ideally it would be a container much like the ExcelFile object i.e. an object that has attributes that are the component data frames etc. Perhaps I can build a new class for that. Its not a huge burning issue though so may do as you suggest and stick with a dict. – Woody Pride Apr 29 '14 at 13:58
  • 1
    You can always easily subclass dict yourself to have attribute access if you really want that, eg http://stackoverflow.com/questions/4984647/accessing-dict-keys-like-an-attribute-in-python – joris Apr 29 '14 at 14:36
  • great idea! Will take me a bit to figure it out, but I like that a lot – Woody Pride Apr 29 '14 at 17:19

0 Answers0