I am currently working on a project involving huge Dataframes to be merged. The below code:
mergeddf = pd.merge(left=leftDataFrame,right=rightDataFrame,right_on = rightKey, left_on = leftKey, how='outer', suffixes = [leftName,rightName], indicator=True)
returns me a merged Dataframe with a column named "_merge" (due to the option indicator=True) which indicates if that row exists in "left_only", "right_only" or "both".
However, I found that merge takes a lot of time, specially when there are many columns as well as may rows (I am running this on chunks of 50K rows with 18 columns). An alternative I tried from Improve Pandas Merge performance is to set my "key" columns to join on as index and then use df.join(df2,how='outer')
and it significantly ran faster!
But my problem is that join() does not return the "_merge" indicator column which I absolutely need. Is there any way to get that information on which row beloged to which dataframe (or both) when using join()
?