How to keep index when using pandas merge

Question

I would like to merge two DataFrames, and keep the index from the first frame as the index on the merged dataset. However, when I do the merge, the resulting DataFrame has integer index. How can I specify that I want to keep the index from the left data frame?

In [4]: a = pd.DataFrame({'col1': {'a': 1, 'b': 2, 'c': 3}, 
                          'to_merge_on': {'a': 1, 'b': 3, 'c': 4}})

In [5]: b = pd.DataFrame({'col2': {0: 1, 1: 2, 2: 3}, 
                          'to_merge_on': {0: 1, 1: 3, 2: 5}})

In [6]: a
Out[6]:
   col1  to_merge_on
a     1            1
b     2            3
c     3            4

In [7]: b
Out[7]:
   col2  to_merge_on
0     1            1
1     2            3
2     3            5

In [8]: a.merge(b, how='left')
Out[8]:
   col1  to_merge_on  col2
0     1            1   1.0
1     2            3   2.0
2     3            4   NaN

In [9]: _.index
Out[9]: Int64Index([0, 1, 2], dtype='int64')

EDIT: Switched to example code that can be easily reproduced

if you merge on a specific column, it is not clear which indices to use (in case they are both different). — bonobo, Aug 23 '18 at 20:56
It is pretty clear if you do a left or right merge for example. — Marses, Jul 16 '21 at 11:26

score 274 · Accepted Answer · edited Aug 27 '20 at 17:58

274

In [5]: a.reset_index().merge(b, how="left").set_index('index')
Out[5]:
       col1  to_merge_on  col2
index
a         1            1     1
b         2            3     2
c         3            4   NaN

Note that for some left merge operations, you may end up with more rows than in a when there are multiple matches between a and b. In this case, you may need to drop duplicates.

edited Aug 27 '20 at 17:58

Asclepius

57,944
17
167
143

answered Aug 16 '12 at 07:53

Wouter Overmeire

65,766
10
63
43

12

Very clever. a.merge(b, how="left").set_index(a.index) also works, but it seems less robust (since the first part of it loses the index values to a before it resets them.) – DanB Aug 16 '12 at 18:01
21

For this particular case, those are equivalent. But for many merge operations, the resulting frame has not the same number of rows than of the original `a` frame. reset_index moves the index to a regular column and set_index from this column after merge also takes care when rows of a are duplicated/removed due to the merge operation. – Wouter Overmeire Aug 16 '12 at 19:35
I didn't know why my solution would break down, but I figured it would. The reset_index() and then set_index() solution makes a lot of sense. Thanks. – DanB Aug 17 '12 at 04:33
3

@Wouter I'd love to know why a left merge will reindex by default. Where can I learn more? – Matthew Jun 08 '16 at 15:25
@Matthew see http://pandas.pydata.org/pandas-docs/stable/merging.html#database-style-dataframe-joining-merging – Wouter Overmeire Jun 09 '16 at 19:23
@DanB, I agree. The proposed soln will break when the index is not called `index`. But I do see your point, the `pandas.DataFrame.merge` could in principle rearrange rows (although my testing suggests that this does not happen). – Stephen McAteer Nov 10 '17 at 05:57
@DanB, on second thoughts, the default behavour of `pandas.DataFrame.merge` is `sort=False` so there shouldn't be an issue with your solution. – Stephen McAteer Nov 10 '17 at 06:07
12

Nice! To avoid explicitly specifying the index-name I use `a.reset_index().merge(b, how="left").set_index(a.index.names)`. – Truls Dec 08 '17 at 09:21
1

doesn't look very nice, though I guess that's how it should be done. I don't quite understand why they're dropping it in the first place, but there must be some reasons! – avloss Jun 17 '18 at 17:12
@WouterOvermeire Your elaboration in the comments about what `reset_index` does should be in the answer. I was going to try to incorporate the comment directly into the answer, but it didn't really work smoothly (especially the other details in the comment). So I think it would be better if you handle it. – jpmc26 Nov 08 '18 at 19:20
18

Pandas badly thought API strikes again. – Henry Henrinson Jul 25 '19 at 21:31
2

I still think pandas should keep the index for you (or at least give you the option to choose). There are many other "default" cases where indexes get duplicated; during a pd.concat for example. I don't think the current working is optimal. – Marses Aug 07 '19 at 12:30
If they have different lengths as a result from inner join, you cannot apply this trick. Any ideas? – matt.aurelio Jun 17 '22 at 09:23
@matt.aurelio it still works if they have different lengths. The "index" from the merged dataframe, minus rows that weren't included in result, becomes the new index. An alternative is to do a left join with `indicator=True` and then filter out "left_only" rows. – David Clarke Jun 22 '23 at 03:15

Matthew Son · Answer 2 · 2019-11-14T18:03:04.673

26

You can make a copy of index on left dataframe and do merge.

a['copy_index'] = a.index
a.merge(b, how='left')

I found this simple method very useful while working with large dataframe and using pd.merge_asof() (or dd.merge_asof()).

This approach would be superior when resetting index is expensive (large dataframe).

edited Nov 14 '19 at 18:03

answered Jul 27 '19 at 21:12

Matthew Son

1,109
8
27

2

This is the best answer. There are many reasons why you would want to preserve your old indexes during a merge (and the accepted answer doesn't preserve indexes, it just resets them). It helps when you're trying to merge more than 2 dataframes, and so on... – Marses Aug 07 '19 at 12:24
2

upvoted but just be wary of a caveat, when using multi-index, your indices will be stored as a tuple in a single column called a[copy_index] – geekidharsh Nov 06 '19 at 06:19
What I am reading in the docs about `merge_asof` indicates it is not using the index to join, it is using the closes index to join. You also have to have your data sorted a certain way so the closest index joins properly. – bfmcneill Nov 12 '20 at 14:58
3

This is just a less elegant version of the `reset_index()` solution. @MartienLubberink is incorrect, as `reset_index()` stores the index as a column by default. – Migwell Apr 27 '22 at 04:12

score 11 · Answer 3 · edited Oct 13 '22 at 23:26

11

There is a non-pd.merge solution using Series.map and DataFrame.set_index.

a['col2'] = a['to_merge_on'].map(b.set_index('to_merge_on')['col2']))

   col1  to_merge_on  col2
a     1            1   1.0
b     2            3   2.0
c     3            4   NaN

This doesn't introduce a dummy index name for the index.

Note however that there is no DataFrame.map method, and so this approach is not for multiple columns.

edited Oct 13 '22 at 23:26

Yuca

6,010
3
22
42

answered Sep 11 '17 at 17:33

Zero

74,117
18
147
154

2

This seems superior to the accepted answer as it will probably work better with edge cases like multi indexes. Can anyone comment on this? – BallpointBen Jan 17 '19 at 06:07
2

question, what if you need to assign multiple columns, would this approach work or is it limited to only 1 field? – Yuca Mar 11 '19 at 15:24
2

@Yuca: This possibly won't work with multiple columns, since when you subset multiple columns you end up with a `pd.Dataframe` and not a `pd.Series`. The `.map()` method is only defined for the `pd.Series`. This is to mean that: `a[['to_merge_on_1', 'to_merge_on_2']].map(...)` won't work. – Dataman Feb 13 '20 at 13:08
Brilliant. In my project we are using too many pandas tricks everywhere. This is very refreshing as it is straight forward and low level. Thank you! – F. Santiago Dec 06 '21 at 09:29

score 8 · Answer 4 · edited Aug 27 '20 at 19:34

8

df1 = df1.merge(df2, how="inner", left_index=True, right_index=True)

This allows to preserve the index of df1

edited Aug 27 '20 at 19:34

Asclepius

57,944
17
167
143

answered Apr 26 '19 at 06:43

Supratik Majumdar

2,365
1
23
31

It seems to work, but when I use it with `on=list_of_cols]`, it contradicts the documentation: `If joining columns on columns, the DataFrame indexes *will be ignored*`. Is one of using indices vs. columns has precedence? – Itamar Katz Jan 22 '20 at 09:56
6

@Supratik Majumdar doesn't your suggestion assume the indexes of the dataframes already match? The OP has non-matching indexes and is merging/joining on columns. – James Apr 01 '22 at 17:05

score 4 · Answer 5 · answered Jul 08 '22 at 02:48

You can also use DataFrame.join() method to achieve the same thing. The join method will persist the original index. The column to join can be specified with on parameter.

In [17]: a.join(b.set_index("to_merge_on"), on="to_merge_on")
Out[17]: 
   col1  to_merge_on  col2
a     1            1   1.0
b     2            3   2.0
c     3            4   NaN

lisrael1 · Answer 6 · 2022-04-04T20:03:33.983

2

another simple option is to rename the index to what was before:

a.merge(b, how="left").set_axis(a.index)

merge preserves the order at dataframe 'a', but just resets the index so it's safe to use set_axis

edited Apr 04 '22 at 20:03

answered May 04 '20 at 07:37

lisrael1

348
2
7

This didn't work for me – James Apr 01 '22 at 17:12
Nope. Doesn't work.. – Sahar Millis Dec 25 '22 at 18:56
this is the correct syntax: `a.merge(b, how="left").set_index(a.index)` – Caio Castro Feb 22 '23 at 18:28

score 2 · Answer 7 · answered Sep 24 '21 at 07:49

2

Assuming that the resulting df has the same number of rows and order as your first df, you can do this:

c = pd.merge(a, b, on='to_merge_on')
c.set_index(a.index,inplace=True)

answered Sep 24 '21 at 07:49

Alicia

21
2

score 0 · Answer 8 · answered Jan 24 '20 at 19:11

Think I've come up with a different solution. I was joining the left table on index value and the right table on a column value based off index of left table. What I did was a normal merge:

First10ReviewsJoined = pd.merge(First10Reviews, df, left_index=True, right_on='Line Number')

Then I retrieved the new index numbers from the merged table and put them in a new column named Sentiment Line Number:

First10ReviewsJoined['Sentiment Line Number']= First10ReviewsJoined.index.tolist()

Then I manually set the index back to the original, left table index based off pre-existing column called Line Number (the column value I joined on from left table index):

First10ReviewsJoined.set_index('Line Number', inplace=True)

Then removed the index name of Line Number so that it remains blank:

First10ReviewsJoined.index.name = None

Maybe a bit of a hack but seems to work well and relatively simple. Also, guess it reduces risk of duplicates/messing up your data. Hopefully that all makes sense.

score 0 · Answer 9 · answered Jun 23 '22 at 01:49

For the people that wants to maintain the left index as it was before the left join:

def left_join(
    a: pandas.DataFrame, b: pandas.DataFrame, on: list[str], b_columns: list[str] = None
) -> pandas.DataFrame:
    if b_columns:
        b_columns = set(on + b_columns)
        b = b[b_columns]
    df = (
        a.reset_index()
        .merge(
            b,
            how="left",
            on=on,
        )
        .set_index(keys=[x or "index" for x in a.index.names])
    )
    df.index.names = a.index.names
    return df

How to keep index when using pandas merge

9 Answers9

Linked

Related