Create dataframes in for loop from multiple Excel workbooks based on worksheet name?

Question

I have a folder of a few hundred Excel files all organized identically with nine sheets in each workbook. I am running the following code to iterate over the files and create one dataframe for each worksheet across all workbooks (so dataframe "sheet_a_df" will be sheet "a" from every workbook concatenated into a single dataframe).

sheet_a_df = pd.DataFrame()
for file in glob.glob('C:\\Users\*.xlsx'):
    df = pd.read_excel(file,sheetname='a')
    sheet_1_df = sheet_1_df.append(df,ignore_index=True).dropna()

sheet_b_df = pd.DataFrame()
for file in glob.glob('C:\\Users\\*.xlsx'):
    df = pd.read_excel(file,sheetname='b')
    sheet_b_df = sheet_b_df.append(df,ignore_index=True).dropna()

# And so on for all nine sheet names...

However, this requires copy and pasting the code nine times (once for each sheet).

Is there a more appropriate way to do this?

Reviewing this question, I understand dictionaries are the way to go for creating multiple dataframes in a for loop. I am also trying to name each df according to the worksheet's name. I created a list of my sheet names and tried the following code, but am getting a KeyError that simply returns the first sheet's name.

sheet_names = ['a',
               'b',
               'c',
               ...,]

df_dict = {}

for file in glob.glob('C:\\Users\*.xlsx'):
    for sheet in sheet_names:
        df = pd.read_excel(file,sheetname=sheet)
        df_dict[sheet] = df_dict[sheet].append(df)

Is there a way to fix the above code to create all nine dfs while naming them according to the sheets they come from?

score 1 · Accepted Answer · answered Aug 25 '17 at 23:42

You can take advantage of the fact that if you pass a list of sheet names to the sheetname parameter of the pd.read_excel function, it will return a dictionary of dataframes where the keys are the sheet names and the values are the dataframes corresponding to those sheet names. As a result, the following should get you a dictionary of concatenated dataframes: all "a" dataframes together, all "b" dataframes together, so on.

sheet_names = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i']
data = {}
for fn in glob.glob('C:\\Users\*.xlsx'):
    dfs = pd.read_excel(fn, sheetname=sheet_names)
    for k in dfs:
        data.setdefault(k, pd.DataFrame())
        data[k] = pd.concat([data[k], dfs[k]])

Now data should be a dictionary of dataframes with keys containing elements from sheet_names. Its values are the concatenated dataframes of corresponding sheet names from your files.

I hope this helps.

Thanks; that's what I was looking for. – dmitriys Aug 26 '17 at 01:49 — dmitriys, Aug 26 '17 at 01:49

Michael · Answer 2 · 2017-08-26T01:36:32.867

0

You are trying to append a dataframe to a non-existent dictionary item. You should first check if the key exists:

for file in glob.glob('C:\\Users\*.xlsx'):
    for sheet in sheet_names:
        df = pd.read_excel(file,sheetname=sheet)
        if sheet in df_dict:
            df_dict[sheet] = df_dict[sheet].append(df)
        else:
            df_dict[sheet] = df

edited Aug 26 '17 at 01:36

answered Aug 25 '17 at 23:22

Michael

5,095
2
13
35

By 'df_list' do you mean a list of dict keys? I created `df_list` with `list(dict.fromkeys(sheet_names))` but get a _'NoneType' object has no attribute 'append'_ error. – dmitriys Aug 26 '17 at 01:25

Create dataframes in for loop from multiple Excel workbooks based on worksheet name?

2 Answers2