17

How can I outer merge two data frames in place in pandas?

For example, assume we have these two data frames:

import pandas as pd

s1 = pd.DataFrame({
    'time':[1234567000,1234567005,1234567009],
    'X1':[96.32,96.01,96.05]
},columns=['time','X1'])  # to keep columns order

s2 = pd.DataFrame({
    'time':[1234567001,1234567005],
    'X2':[23.88,23.96]
},columns=['time','X2'])  # to keep columns order

They could be merged with pandas.DataFrame.merge (s3 = pd.merge(s1,s2,how='outer')) or with pandas.merge (s3=s1.merge(s2,how='outer')), but it isn't in place. Instead, I'd like the merged data frame to replace s1 in memory.

Franck Dernoncourt
  • 77,520
  • 72
  • 342
  • 501
  • Are you even sure avoiding copies is possible? There's a `copy` kwarg that ["*If False, do not copy data unnecessarily*"](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.merge.html), but playing around with it and `np.share_memory` suggests that in your example the result is always copied. Which is not that surprising, considering that the resulting dataframe is quite different from the originals. But I'm not very experienced with pandas so my impressions might be way off. – Andras Deak -- Слава Україні Jul 05 '17 at 23:32
  • What do you mean "it isn't in place"? What are you trying to achieve? – Andrew L Jul 05 '17 at 23:48
  • @AndrewL I'd like the merged data frame to replace s1 in memory. – Franck Dernoncourt Jul 05 '17 at 23:52
  • 2
    well since there isn't any `inplace` parameter i think the most you can do here is `s1 = pd.merge(s1,s2,how='outer')`, other than that, there's not much left to do i think.. sorry – Rayhane Mama Jul 05 '17 at 23:58
  • @RayhaneMama Thanks. You're welcome to convert the comment into an answer. – Franck Dernoncourt Jul 06 '17 at 14:31
  • 1
    @AndrasDeak FYI [What are the exact downsides of copy=False in DataFrame.merge()?](https://stackoverflow.com/q/32330490/395857) – Franck Dernoncourt Jul 08 '17 at 19:40
  • 1
    Thanks for the link. I'm probably not qualified to answer, though:) My hunch is that copying can only be avoided in very special cases (considering that `.merge` is a very general workhorse that provides all sorts of operations), when the result verbatim contains columns of the original dataframe(s). – Andras Deak -- Слава Україні Jul 08 '17 at 19:51
  • 1
    Franck FYI [I tried to investigate anyway](https://stackoverflow.com/a/44992292/5067311). I convinced myself that the `copy` kwarg is mostly for show unless one's doing something trivial/weird. In your case I suspect there are bunch of actual concatenations going on, so no skipping copies for you :( – Andras Deak -- Слава Україні Jul 09 '17 at 02:00

2 Answers2

17

Since there is not inplace parameter in pandas.merge i think the most you can do is:

s1 = pd.merge(s1,s2,how='outer')

other than that, i don't think there's much left to do.
Hope that was helpful somehow.

Rayhane Mama
  • 2,374
  • 11
  • 20
  • 5
    It is really a pity that there is no in-place merge or join (even with restriction that indices must match). – sophros Feb 03 '20 at 12:30
  • 2
    this solution does not perform an "inplace" substitution. It replaces the initial s1 reference by a new one. So it does not behave the same especially if you pass the s1 dataframe to a function and perform the merge within the function... – jeromerg May 10 '22 at 19:13
  • 2
    @jeromerg exactly, that's because, as pointed out by Rayhane himself and the whole of the pandas library, there ISN'T a way to perform an inplace substitution through pandas. That's clear enough from the short answer text itself... – Julio Cezar Silva Aug 17 '22 at 15:03
2

The marked answer is correct, there is no built-in way to this. Here are a couple ways I found to fake it in certain scenarios. They're likely quite slow, but they do suffice to minimize memory footprint. Use at your own risk.

s1["X2"] = float("NaN")

for i, row in s2.iterrows():
    if row.time in s1.time.values:
        s1.loc[s1.time == row.time, "X2"] = row.X2
    else:
        s1.loc[len(s1), :] = row

or

for _, row in s2.loc[s2.time.isin(s1.time)].iterrows():
    s1.loc[s1.time == row.time, "X2"] = row.X2

for _, row in s2.loc[~s2.time.isin(s1.time)].iterrows():
    s1.loc[len(s1), :] = row
Alecg_O
  • 892
  • 1
  • 9
  • 19