Merging two dataframes and removing duplicate rows WITH duplicate indexes (pandas)

Question

I have read miscellaneous posts with a similar question but couldn't find exactly this question. I have two pandas DataFrames that I want to merge. They have timestamps as indexes. The 2nd Dataframe basically overlaps the 1st and they thus both share rows with same timestamps and values.

I would like to remove these rows because they share everything: index and values in columns. If they don't share both index and values in columns, I want to keep them.

So far, I could point out:

Index.drop_duplicate: this is not what I am looking for. It doesn't check values in columns are the same. And I want to keep rows with same timestamps but different values in columns

DataFrame.drop_duplicate: well, same as above, it doesn't check index value, and if rows are found with same values in column but different indexes, I want to keep them.

To give an example, I am re-using the data given in below answer.

df1

                        Value
 2012-02-01 12:00:00     10
 2012-02-01 12:30:00     10
 2012-02-01 13:00:00     20
 2012-02-01 13:30:00     30

df2

                        Value
 2012-02-01 12:30:00     20
 2012-02-01 13:00:00     20
 2012-02-01 13:30:00     30
 2012-02-02 14:00:00     10

Result I would like to obtain is the following one:

                        Value
 2012-02-01 12:00:00     10  #(from df1)
 2012-02-01 12:30:00     10  #(from df1)
 2012-02-01 12:30:00     20  #(from df2 - same index than in df1, but different value)
 2012-02-01 13:00:00     20  #(in df1 & df2, only one kept)
 2012-02-01 13:30:00     30  #(in df1 & df2, only one kept)
 2012-02-02 14:00:00     10  #(from df2)

Please, any idea? Thanks for your help! Bests

Read [this link](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) on how to post a good question. You don't have example input tables or examples of what you want your table to look like afterwards. Help us help you. — Ukrainian-serge, Feb 27 '20 at 19:19
[ask], [mcve], [help/on-topic], https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples — AMC, Feb 27 '20 at 19:55
Hello Serge, yes, sorry. You are right, I have added an example of data. — pierre_j, Feb 27 '20 at 20:47

Valdi_Bo · Accepted Answer · 2020-02-27T20:58:09.297

Assume that you have 2 following DataFrames:

df:

                 Date  Value
0 2012-02-01 12:00:00     10
1 2012-02-01 12:30:00     10
2 2012-02-01 13:00:00     20
3 2012-02-01 13:30:00     30
4 2012-02-02 14:00:00     10
5 2012-02-02 14:30:00     10
6 2012-02-02 15:00:00     20
7 2012-02-02 15:30:00     30

df2:

                 Date  Value
0 2012-02-01 12:00:00     10
1 2012-02-01 12:30:00     21
2 2012-02-01 12:40:00     22
3 2012-02-01 13:00:00     20
4 2012-02-01 13:30:00     30

To generate the result, run:

pd.concat([df, df2]).sort_values('Date')\
    .drop_duplicates().reset_index(drop=True)

The result, for the above data, is:

                 Date  Value
0 2012-02-01 12:00:00     10
1 2012-02-01 12:30:00     10
2 2012-02-01 12:30:00     21
3 2012-02-01 12:40:00     22
4 2012-02-01 13:00:00     20
5 2012-02-01 13:30:00     30
6 2012-02-02 14:00:00     10
7 2012-02-02 14:30:00     10
8 2012-02-02 15:00:00     20
9 2012-02-02 15:30:00     30

drop_duplicates drops duplicated rows, keeping the first. Since no subset parameter has been passed, the criterion to treat 2 rows as duplicates is identity of all columns.

Hello Valdi. Thanks for this answer. Following Serge comment, I have added an example based on the data you actually provided. Your answer is not matching what I intend. Please, do you think that considering the example I give, your approach can be adapted? — pierre_j, Feb 27 '20 at 20:48
Now I see that you want to keep one of duplicating rows. The difference in code boils down to removing *keep* parameter. — Valdi_Bo, Feb 27 '20 at 20:59

score 1 · Answer 2 · edited Nov 25 '21 at 14:25

1

Just improving the first answer, insert Date inside drop_duplicates

pd.concat([df, df2]).sort_values('Date')\
.drop_duplicates('Date').reset_index(drop=True)

edited Nov 25 '21 at 14:25

Suraj Rao

29,388
11
94
103

answered Nov 25 '21 at 14:23

Monica

11
2

Merging two dataframes and removing duplicate rows WITH duplicate indexes (pandas)

2 Answers2