I have 2 large dataframes which I have previously outlined in this question Faster method to join and filter two overlapping DataFrame with extra shared variable.
As a continuation to that question, I have been trying to use a join instead of a merge to speed up the operation. The join does indeed work much faster than the merge, but now it seems bogged down in the .loc operation despite the dataframes being sorted seem nearly identical, at least in terms of column names and df sizes.
Here is an example of the two different approaches I have taken:
df = pd.merge(df2,df1,how='left',left_on=['CP_ID','Con'], \
right_on=['CP_ID','Con'])
df = df.loc[(df.Timestamp>df.dStart)&(df.Timestamp<df.dEnd)]
and
df2_test = df2.copy()
df1_test = df1.copy()
df11 = df1_test.set_index(['CP_ID','Con'])
df22 = df2_test.set_index(['CP_ID','Con'])
df_test2 = df22.join(df11, how='left')
df_test2.reset_index(inplace=True)
df_test2 = df_test2.loc[(df_test2.Timestamp>df_test2.dStart)&(df_test2.Timestamp<df_test2.dEnd)]
I see a 3x speed improvement in the merge/join operation, but a 10x slow down in the .loc operation if I use the join instead of merge, which essentially cancels out any gains overall. Can anyone explain this? I have manually inspected the dataframes after the join/merge, and I can't really tell any difference that would be causing this huge slow down in the .loc