I have many Excel files in a directory with the same structure for each file -- for example the data below could be test1.xlsx:
Date Type Name Task Subtask Hours
3/20/16 Type1 Name1 TaskXyz SubtaskXYZ 1.00
3/20/16 Type1 Name2 TaskXyz SubtaskXYZ 2.00
3/20/16 Type1 Name3 TaskXyz SubtaskXYZ 1.00
What I would like to do is create a new Excel file with the file name and sum of each file in the directory that would look like this:
File Name Sum of hours
Test1.xlsx 4
test2.xlsx 10
... ...
I just started playing around with glob, and that has been helpful for creating one large dataframe like this:
all_data = pd.DataFrame()
for f in glob.glob("path/*.xlsx"):
df = pd.read_excel(f, skiprows=4,index_col=None, na_values=['NA'])
all_data = all_data.append(df,ignore_index=True)
This has been helpful for creating a dataframe of all the data agnostic of the sheet it came from and I have been able to use groupbys to analyze the data on a macro level but, for all that i know, i cannot sum by sheet put into the data frame only things like:
task_output = all_data.groupby(["Task","Subtask"])["Hours"].agg([np.sum,np.mean])
Where on the whole dataframe i am able to sum and get a mean vs each individual sheet.
Any ideas on where to start with this?