How to save the sum of specific columns from diffrent files

Question

I have written a code (thanks to) that groupe the column that I need to remain as it is and sum of the targeted columns:

import pandas as pd
import glob as glob
import numpy as np

#Read excel and Create DF  
all_data = pd.DataFrame()
for f in glob.glob(r'C:\Users\Sarah\Desktop\IDPMosul\Data\2014\09\*.xlsx'):
    df = pd.read_excel(f,index_col=None, na_values=['NA'])
    df['filename'] = f
    data = all_data.append(df,ignore_index=True)

#Group and Sum
result = data.groupby(["Date"])["Families","Individuals"].agg([np.sum])

#Save file
file_name = r'C:\Users\Sarah\Desktop\U2014.csv'
result.to_csv(file_name, index=True)

the problem is here :

 #Save file
 file_name = r'C:\Users\Sarah\Desktop\U2014.csv'
 result.to_csv(file_name, index=True)

the code gives me the result that I want however it only takes into account the last file that it iterates through, I need to save all the sums from different files thank you

score 1 · Accepted Answer · answered Jan 30 '20 at 15:53

Simply you never change all_data in the loop since it is never re-assigned. Each loop iteration appends to the empty data frame initialized outside loop. So only the very last file is retained. A quick (non-recommended) fix would include:

all_data = pd.DataFrame()
for f in glob.glob(r'C:\Users\Sarah\Desktop\IDPMosul\Data\2014\09\*.xlsx'):
   ...
   all_data = all_data.append(df, ignore_index=True)    # CHANGE LAST LINE IN LOOP

# USE all_data (NOT data) aggregation
result = all_data.groupby(...)

However, reconsider growing a data frame inside a loop. As @unutbu warns us: Never call DataFrame.append or pd.concat inside a for-loop. It leads to quadratic copying. Instead, the recommended version would be to build a list of data frames to concatenate once outside the loop which you can do so with a list comprehension, even assign for filename:

# BUILD LIST OF DFs  
df_list = [(pd.read_excel(f, index_col=None, na_values=['NA']) 
              .assign(filename = f)
           ) for f in glob.glob(r'C:\Users\Sarah\Desktop\IDPMosul\Data\2014\09\*.xlsx')]

# CONCATENATE ALL DFs
data = pd.concat(df_list, ignore_index=True)

# AGGREGATE DATA
result = data.groupby(["Date"])["Families", "Individuals"].agg([np.sum])

file_name = r'C:\Users\Sarah\Desktop\U2014.csv'
result.to_csv(file_name, index=True)

How to save the sum of specific columns from diffrent files

1 Answers1