2

I am currently working on a project involving huge Dataframes to be merged. The below code:

mergeddf = pd.merge(left=leftDataFrame,right=rightDataFrame,right_on = rightKey, left_on = leftKey, how='outer', suffixes = [leftName,rightName], indicator=True)

returns me a merged Dataframe with a column named "_merge" (due to the option indicator=True) which indicates if that row exists in "left_only", "right_only" or "both".

However, I found that merge takes a lot of time, specially when there are many columns as well as may rows (I am running this on chunks of 50K rows with 18 columns). An alternative I tried from Improve Pandas Merge performance is to set my "key" columns to join on as index and then use df.join(df2,how='outer') and it significantly ran faster!

But my problem is that join() does not return the "_merge" indicator column which I absolutely need. Is there any way to get that information on which row beloged to which dataframe (or both) when using join()?

KCK
  • 2,015
  • 2
  • 17
  • 35
  • 1
    join is a merge shortcut where `left_index=True, right_index=True`. And the speed gain is actually from the created indexes, nothing specific to join. So set your indexes as before but instead of join, merge on indexes. – ayhan Feb 25 '21 at 10:43
  • Tried with ```left_index=True, right_index=True``` along with the same indexes but it somehow takes more time. By shortcut do you mean join internally call merge? – KCK Feb 25 '21 at 10:48
  • hmm, an outer merge shouldn't take that long, can you show the structure of your data? – Umar.H Feb 25 '21 at 10:50
  • @Manakin I am just loading huge csv files into the DFs. there are 18 columns with Numbers and Words in them. Won't be able to show a sample as I am not allowed to do so :) – KCK Feb 25 '21 at 10:52
  • 1
    Yes, at least it was the case before. I don't know if they added additional optimizations. From the docs: "The related join() method, uses merge internally for the index-on-index (by default) and column(s)-on-index join. If you are joining on index only, you may wish to use DataFrame.join to save yourself some typing." https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html – ayhan Feb 25 '21 at 10:55
  • it's your duty at a minimum to create a [mcve] if we can't see your data then how can anyone advise? – Umar.H Feb 25 '21 at 10:56
  • @Manakin My question is not how to increase the speed of merge. My question was if something else can be "achieved" within the limits of what python provides. Not sure why you'll need to reproduce anything here as its a ques related to the current python/pandas language spec and the accepted answer does exactly that. If you agree, can you please revert your downvote? – KCK Feb 25 '21 at 11:09
  • 1
    @ayhan Thanks for the comments. I checked the pandas code and indeed found merge with indexes being used in it.Thanks. – KCK Feb 25 '21 at 11:09

1 Answers1

1

@ayhan made the correct comment before me, but here's an elaboration:

leftDataFrame = leftDataFrame.set_index(leftKey)
rightDataFrame = rightDataFrame.set_index(rightKey)
mergeddf = pd.merge(
    left=leftDataFrame,
    right=rightDataFrame,
    left_index=True,
    right_index=True,
    how='outer',
    suffixes = [leftName, rightName],
    indicator=True)

As per documentation (https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html):

The related join() method, uses merge internally for the index-on-index (by ?>default) and column(s)-on-index join. If you are joining on index only, you may >wish to use DataFrame.join to save yourself some typing.

Digging deeper into the code, you can see that the code is a bit more involved than just a wrapper around merge, but the core functionality of join is captured with these lines:

                joined = merge(
                    joined, frame, how=how, left_index=True, right_index=True
                )
SultanOrazbayev
  • 14,900
  • 3
  • 16
  • 46
  • I tried the same already, but it does take more time than using ```join()```. Ayhan mentioned about join being a shortcut for merge with indexes, doe that mean join internally calls merge? If so, the time diff may be a random one? – KCK Feb 25 '21 at 10:50
  • Yes, my guess is that the time difference is due to randomness. If it's a substantial time difference and you can replicate it, then there's scope for digging into it... – SultanOrazbayev Feb 25 '21 at 10:55