1

I have many Excel files in a directory with the same structure for each file -- for example the data below could be test1.xlsx:

Date      Type     Name      Task       Subtask       Hours
3/20/16   Type1    Name1     TaskXyz    SubtaskXYZ    1.00  
3/20/16   Type1    Name2     TaskXyz    SubtaskXYZ    2.00  
3/20/16   Type1    Name3     TaskXyz    SubtaskXYZ    1.00  

What I would like to do is create a new Excel file with the file name and sum of each file in the directory that would look like this:

File Name     Sum of hours
Test1.xlsx    4
test2.xlsx    10
...           ...

I just started playing around with glob, and that has been helpful for creating one large dataframe like this:

all_data = pd.DataFrame()
for f in glob.glob("path/*.xlsx"):
    df = pd.read_excel(f, skiprows=4,index_col=None, na_values=['NA'])
    all_data = all_data.append(df,ignore_index=True)

This has been helpful for creating a dataframe of all the data agnostic of the sheet it came from and I have been able to use groupbys to analyze the data on a macro level but, for all that i know, i cannot sum by sheet put into the data frame only things like:

task_output = all_data.groupby(["Task","Subtask"])["Hours"].agg([np.sum,np.mean])

Where on the whole dataframe i am able to sum and get a mean vs each individual sheet.

Any ideas on where to start with this?

tmgolf
  • 43
  • 7

2 Answers2

2

I would collect all your data frames into one list and then concatenate them in one shot - it should be much faster:

import os
import glob
import pandas as pd

def merge_excel_to_df_add_filename(flist, **kwargs):
    dfs = []
    for f in flist:    
        df = pd.read_excel(f, **kwargs)
        df['file'] = f
        dfs.append(df)
    return pd.concat(dfs, ignore_index=True)

fmask = os.path.join('/path/to/excel/files', '*.xlsx')
df = merge_excel_to_df_add_filename(glob.glob(fmask),
                                    skiprows=4,
                                    index_col=None,
                                    na_values=['NA'])
g = df.groupby('file')['Hours'].agg({'Hours': ['sum','mean']}).reset_index()
# rename columns
g.columns = ['File_Name', 'sum of hours', 'average hours']
# write result to Excel file
g.to_excel('result.xlsx', index=False)
MaxU - stand with Ukraine
  • 205,989
  • 36
  • 386
  • 419
1

While you reading file into memory you should remeber filename you are currently processing:

all_data = pd.DataFrame()
for f in glob.glob("path/*.xlsx"):
    df = pd.read_excel(f, skiprows=4,index_col=None, na_values=['NA'])
    df['filename'] = f
    all_data = all_data.append(df,ignore_index=True)

task_output = all_data.groupby(['filename', "Task","Subtask"])["Hours"].agg([np.sum,np.mean])   
biniow
  • 391
  • 1
  • 10