I have multiple dataframes that are generated in different iterations of a loop, as shown below: d1 is created in iteration 1, d2 in iteration 2, and so on..
d1=pd.DataFrame({'PARTICIPANT_ID':['idA'],'AGE':[32],'GENDER':['male'],'colA':[20],'colB':[100]})
d2=pd.DataFrame({'PARTICIPANT_ID':['idA'],'AGE':[32],'GENDER':['male'],'colC':[1],'colD':[6]})
d3=pd.DataFrame({'PARTICIPANT_ID':['idA'],'AGE':[32],'GENDER':['male'],'colE':[60],'colF':[11]})
d4=pd.DataFrame({'PARTICIPANT_ID':['idB'],'AGE':[43],'GENDER':['female'],'colA':[30],'colB':[200]})
d5=pd.DataFrame({'PARTICIPANT_ID':['idB'],'AGE':[43],'GENDER':['female'],'colC':[2],'colD':[7]})
d6=pd.DataFrame({'PARTICIPANT_ID':['idB'],'AGE':[43],'GENDER':['female'],'colE':[70],'colF':[12]})
d7=pd.DataFrame({'PARTICIPANT_ID':['idC'],'AGE':[28],'GENDER':['female'],'colE':[56],'colF':[48]})
I want to keep merging these dataframes after each iteration to a bigger final dataframe, or store them as dictionaries or some other data type and merge them together at the end of loop.
This is what the output needs to look like (PARTICIPANT_ID alone can act as index to these dataframes):
PARTICIPANT_ID AGE GENDER colA colB colC colD colE colF
idA 32 male 20.0 100.0 1.0 6.0 60 11
idB 43 female 30.0 200.0 2.0 7.0 70 12
idC 28 female NaN NaN NaN NaN 56 48
I'm currently doing like:
df_final = df_final.set_index(['PARTICIPANT_ID','AGE','GENDER'],inplace=True).combine_first(d1.set_index(['PARTICIPANT_ID','AGE','GENDER'],inplace=True))
where df_final is the final output dataframe and I'm repeating this process in loop for every new dataframe that's generated in each iteration.
Problem with this type of merge is that it is PAINFULLY SLOW. Can someone please suggest a better way to achieve the same output in a FASTER AND EFFICIENT way.
Please note that the loop iterate over several hundred thousands records, and has many more columns than shown in example above.