Create new sheet with sums of specific column for each file in directory of multiple xlsx files

Question

I have many Excel files in a directory with the same structure for each file -- for example the data below could be test1.xlsx:

Date      Type     Name      Task       Subtask       Hours
3/20/16   Type1    Name1     TaskXyz    SubtaskXYZ    1.00  
3/20/16   Type1    Name2     TaskXyz    SubtaskXYZ    2.00  
3/20/16   Type1    Name3     TaskXyz    SubtaskXYZ    1.00

What I would like to do is create a new Excel file with the file name and sum of each file in the directory that would look like this:

File Name     Sum of hours
Test1.xlsx    4
test2.xlsx    10
...           ...

I just started playing around with glob, and that has been helpful for creating one large dataframe like this:

all_data = pd.DataFrame()
for f in glob.glob("path/*.xlsx"):
    df = pd.read_excel(f, skiprows=4,index_col=None, na_values=['NA'])
    all_data = all_data.append(df,ignore_index=True)

This has been helpful for creating a dataframe of all the data agnostic of the sheet it came from and I have been able to use groupbys to analyze the data on a macro level but, for all that i know, i cannot sum by sheet put into the data frame only things like:

task_output = all_data.groupby(["Task","Subtask"])["Hours"].agg([np.sum,np.mean])

Where on the whole dataframe i am able to sum and get a mean vs each individual sheet.

Any ideas on where to start with this?

@MaxU I added a bit more context, the short of it is that i have been able to get a sum of the dataframe as a whole (all the excel sheets in one data frame) but have not been able to get a sum of hours per sheet. — tmgolf, Mar 20 '16 at 22:45
how do you want to group your data - by filename or by `["Task","Subtask"]`? And do you have your data in multiple sheets in your excel file or is it always one sheet? — MaxU - stand with Ukraine, Mar 20 '16 at 22:48
@MaxU Always one sheet and ideally it would be filename and sum of hours — tmgolf, Mar 20 '16 at 22:55

MaxU - stand with Ukraine · Answer 1 · 2016-03-20T23:16:58.153

2

I would collect all your data frames into one list and then concatenate them in one shot - it should be much faster:

import os
import glob
import pandas as pd

def merge_excel_to_df_add_filename(flist, **kwargs):
    dfs = []
    for f in flist:    
        df = pd.read_excel(f, **kwargs)
        df['file'] = f
        dfs.append(df)
    return pd.concat(dfs, ignore_index=True)

fmask = os.path.join('/path/to/excel/files', '*.xlsx')
df = merge_excel_to_df_add_filename(glob.glob(fmask),
                                    skiprows=4,
                                    index_col=None,
                                    na_values=['NA'])
g = df.groupby('file')['Hours'].agg({'Hours': ['sum','mean']}).reset_index()
# rename columns
g.columns = ['File_Name', 'sum of hours', 'average hours']
# write result to Excel file
g.to_excel('result.xlsx', index=False)

edited Mar 20 '16 at 23:16

answered Mar 20 '16 at 23:10

MaxU - stand with Ukraine

205,989
36
386
419

Hmm, i get an error: raise ValueError('If using all scalar values, you must pass' ValueError: If using all scalar values, you must pass an index – tmgolf Mar 20 '16 at 23:17
@tmgolf, check the path to your excel files in `fmask = os.path.join('/path/to/excel/files', '*.xlsx')` – MaxU - stand with Ukraine Mar 20 '16 at 23:20
i updated it to be /Users//Desktop/test and even with your most recent update get the error. – tmgolf Mar 20 '16 at 23:22

score 1 · Accepted Answer · answered Mar 20 '16 at 22:58

1

While you reading file into memory you should remeber filename you are currently processing:

all_data = pd.DataFrame()
for f in glob.glob("path/*.xlsx"):
    df = pd.read_excel(f, skiprows=4,index_col=None, na_values=['NA'])
    df['filename'] = f
    all_data = all_data.append(df,ignore_index=True)

task_output = all_data.groupby(['filename', "Task","Subtask"])["Hours"].agg([np.sum,np.mean])

answered Mar 20 '16 at 22:58

biniow

391
1
10

that makes a lot of sense. One thing I am running into now is that it is only pulling for the last file in the directory – tmgolf Mar 20 '16 at 23:12
i'm not sure what do you mean. Can you explain ? – biniow Mar 20 '16 at 23:31
an error i made with formatting. This worked thanks! – tmgolf Mar 20 '16 at 23:40

Create new sheet with sums of specific column for each file in directory of multiple xlsx files

2 Answers2

Linked