1

I have multiple dataframes that are generated in different iterations of a loop, as shown below: d1 is created in iteration 1, d2 in iteration 2, and so on..

d1=pd.DataFrame({'PARTICIPANT_ID':['idA'],'AGE':[32],'GENDER':['male'],'colA':[20],'colB':[100]})
d2=pd.DataFrame({'PARTICIPANT_ID':['idA'],'AGE':[32],'GENDER':['male'],'colC':[1],'colD':[6]})
d3=pd.DataFrame({'PARTICIPANT_ID':['idA'],'AGE':[32],'GENDER':['male'],'colE':[60],'colF':[11]})
d4=pd.DataFrame({'PARTICIPANT_ID':['idB'],'AGE':[43],'GENDER':['female'],'colA':[30],'colB':[200]})
d5=pd.DataFrame({'PARTICIPANT_ID':['idB'],'AGE':[43],'GENDER':['female'],'colC':[2],'colD':[7]})
d6=pd.DataFrame({'PARTICIPANT_ID':['idB'],'AGE':[43],'GENDER':['female'],'colE':[70],'colF':[12]})
d7=pd.DataFrame({'PARTICIPANT_ID':['idC'],'AGE':[28],'GENDER':['female'],'colE':[56],'colF':[48]})

I want to keep merging these dataframes after each iteration to a bigger final dataframe, or store them as dictionaries or some other data type and merge them together at the end of loop.

This is what the output needs to look like (PARTICIPANT_ID alone can act as index to these dataframes):

PARTICIPANT_ID  AGE GENDER  colA    colB    colC    colD    colE    colF
idA             32  male    20.0    100.0   1.0     6.0     60      11
idB             43  female  30.0    200.0   2.0     7.0     70      12
idC             28  female  NaN     NaN     NaN     NaN     56      48

I'm currently doing like:

df_final = df_final.set_index(['PARTICIPANT_ID','AGE','GENDER'],inplace=True).combine_first(d1.set_index(['PARTICIPANT_ID','AGE','GENDER'],inplace=True))

where df_final is the final output dataframe and I'm repeating this process in loop for every new dataframe that's generated in each iteration.

Problem with this type of merge is that it is PAINFULLY SLOW. Can someone please suggest a better way to achieve the same output in a FASTER AND EFFICIENT way.

Please note that the loop iterate over several hundred thousands records, and has many more columns than shown in example above.

anon
  • 21
  • 4

1 Answers1

0

You can get the same logic by concat + groupby + first, perhaps it's faster for your real data:

df_res = (pd.concat([d1, d2, d3, d4, d5, d6, d7], sort=False)
              .groupby(['PARTICIPANT_ID', 'AGE', 'GENDER']).first())
#                           colA   colB  colC  colD  colE  colF
#PARTICIPANT_ID AGE GENDER                                     
#idA            32  male    20.0  100.0   1.0   6.0  60.0  11.0
#idB            43  female  30.0  200.0   2.0   7.0  70.0  12.0
#idC            28  female   NaN    NaN   NaN   NaN  56.0  48.0

Otherwise, I would have said reduce but you seem to be doing that already:

from functools import reduce

reduce(lambda l,r: l.combine_first(r), 
       [x.set_index(['PARTICIPANT_ID', 'AGE', 'GENDER']) for x in [d1, d2, d3, d4, d5, d6, d7]])

myl = [d1, d2, d3, d4, d5, d6, d7]

%timeit pd.concat(myl, sort=False).groupby(['PARTICIPANT_ID', 'AGE', 'GENDER']).first()
#9.11 ms ± 310 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%timeit reduce(lambda l,r: l.combine_first(r), [x.set_index(['PARTICIPANT_ID', 'AGE', 'GENDER']) for x in myl])
#61.3 ms ± 1.18 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
ALollz
  • 57,915
  • 7
  • 66
  • 89
  • I don't get all dataframes at one time but one in each iteration of a big loop. How do I store these dataframes to, maybe, concatenate later like you suggested. And if I do concat with only 2 dataframes at a time, then the whole process becomes slow as the final dataframe gets bigger, as result of concatenation. – anon Mar 05 '19 at 20:49
  • Cool, let me try the list option you suggested. I'll let you know if it helps. Thanks! – anon Mar 05 '19 at 20:51
  • Yes, you should never append or concatenate `DataFrames` within a loop, as it leads to unnecessary copying; See https://stackoverflow.com/a/37009561/4333359. Just initialize an empty list `mylist=[]`, then do `mylist.append(df)` within the loop (where `df` is the small sub dataframe you get), and at the after your loop (outside) have a single call `pd.concat(mylist)` – ALollz Mar 05 '19 at 20:51
  • @anon I think the major slowdown here is that you wind up needing to combine several thousand small Frames. I suspect that this leads to unnecessary copying, which slows down as the base frame gets large (you can check this by timing the each iteration to see if the time is growing). I suspect you will see a huge improvement storing the small frames in a list, and concatenating only once when you have them all. – ALollz Mar 05 '19 at 21:02
  • And you are 100% right @ALollz..I can see at least 6x-8x improvement in the speed. You are awesome. I'm getting greedy now. Do you think it's better to even store the data in something other than the dataframe to begin with? To improve the speed a bit more. – anon Mar 05 '19 at 21:09
  • apparently I have less than 15 reputation and can't change upvote to change the score...Wish I could though – anon Mar 05 '19 at 21:10
  • I don't know about that other part, I mostly have experience with larger sets, not combining a large amount of small data. Perhaps someone else will answer. – ALollz Mar 05 '19 at 21:12