2

I have read miscellaneous posts with a similar question but couldn't find exactly this question. I have two pandas DataFrames that I want to merge. They have timestamps as indexes. The 2nd Dataframe basically overlaps the 1st and they thus both share rows with same timestamps and values.

I would like to remove these rows because they share everything: index and values in columns. If they don't share both index and values in columns, I want to keep them.

So far, I could point out:

Index.drop_duplicate: this is not what I am looking for. It doesn't check values in columns are the same. And I want to keep rows with same timestamps but different values in columns

DataFrame.drop_duplicate: well, same as above, it doesn't check index value, and if rows are found with same values in column but different indexes, I want to keep them.

To give an example, I am re-using the data given in below answer.

df1

                        Value
 2012-02-01 12:00:00     10
 2012-02-01 12:30:00     10
 2012-02-01 13:00:00     20
 2012-02-01 13:30:00     30

df2

                        Value
 2012-02-01 12:30:00     20
 2012-02-01 13:00:00     20
 2012-02-01 13:30:00     30
 2012-02-02 14:00:00     10

Result I would like to obtain is the following one:

                        Value
 2012-02-01 12:00:00     10  #(from df1)
 2012-02-01 12:30:00     10  #(from df1)
 2012-02-01 12:30:00     20  #(from df2 - same index than in df1, but different value)
 2012-02-01 13:00:00     20  #(in df1 & df2, only one kept)
 2012-02-01 13:30:00     30  #(in df1 & df2, only one kept)
 2012-02-02 14:00:00     10  #(from df2)

Please, any idea? Thanks for your help! Bests

pierre_j
  • 895
  • 2
  • 11
  • 26
  • 1
    Read [this link](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) on how to post a good question. You don't have example input tables or examples of what you want your table to look like afterwards. Help us help you. – Ukrainian-serge Feb 27 '20 at 19:19
  • [ask], [mcve], [help/on-topic], https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples – AMC Feb 27 '20 at 19:55
  • 1
    Hello Serge, yes, sorry. You are right, I have added an example of data. – pierre_j Feb 27 '20 at 20:47

2 Answers2

3

Assume that you have 2 following DataFrames:

  1. df:

                     Date  Value
    0 2012-02-01 12:00:00     10
    1 2012-02-01 12:30:00     10
    2 2012-02-01 13:00:00     20
    3 2012-02-01 13:30:00     30
    4 2012-02-02 14:00:00     10
    5 2012-02-02 14:30:00     10
    6 2012-02-02 15:00:00     20
    7 2012-02-02 15:30:00     30
    
  2. df2:

                     Date  Value
    0 2012-02-01 12:00:00     10
    1 2012-02-01 12:30:00     21
    2 2012-02-01 12:40:00     22
    3 2012-02-01 13:00:00     20
    4 2012-02-01 13:30:00     30
    

To generate the result, run:

pd.concat([df, df2]).sort_values('Date')\
    .drop_duplicates().reset_index(drop=True)

The result, for the above data, is:

                 Date  Value
0 2012-02-01 12:00:00     10
1 2012-02-01 12:30:00     10
2 2012-02-01 12:30:00     21
3 2012-02-01 12:40:00     22
4 2012-02-01 13:00:00     20
5 2012-02-01 13:30:00     30
6 2012-02-02 14:00:00     10
7 2012-02-02 14:30:00     10
8 2012-02-02 15:00:00     20
9 2012-02-02 15:30:00     30

drop_duplicates drops duplicated rows, keeping the first. Since no subset parameter has been passed, the criterion to treat 2 rows as duplicates is identity of all columns.

Valdi_Bo
  • 30,023
  • 4
  • 23
  • 41
  • 1
    Hello Valdi. Thanks for this answer. Following Serge comment, I have added an example based on the data you actually provided. Your answer is not matching what I intend. Please, do you think that considering the example I give, your approach can be adapted? – pierre_j Feb 27 '20 at 20:48
  • Now I see that you want to keep one of duplicating rows. The difference in code boils down to removing *keep* parameter. – Valdi_Bo Feb 27 '20 at 20:59
1

Just improving the first answer, insert Date inside drop_duplicates

pd.concat([df, df2]).sort_values('Date')\
.drop_duplicates('Date').reset_index(drop=True)
Suraj Rao
  • 29,388
  • 11
  • 94
  • 103
Monica
  • 11
  • 2