111

I'd like to concatenate two dataframes A, B to a new one without duplicate rows (if rows in B already exist in A, don't add):

Dataframe A:

   I    II   
0  1    2    
1  3    1    

Dataframe B:

   I    II
0  5    6
1  3    1

New Dataframe:

   I    II
0  1    2
1  3    1
2  5    6

How can I do this?

wjandrea
  • 28,235
  • 9
  • 60
  • 81
MJP
  • 5,327
  • 6
  • 18
  • 18

4 Answers4

164

The simplest way is to just do the concatenation, and then drop duplicates.

>>> df1
   A  B
0  1  2
1  3  1
>>> df2
   A  B
0  5  6
1  3  1
>>> pandas.concat([df1,df2]).drop_duplicates().reset_index(drop=True)
   A  B
0  1  2
1  3  1
2  5  6

The reset_index(drop=True) is to fix up the index after the concat() and drop_duplicates(). Without it you will have an index of [0,1,0] instead of [0,1,2]. This could cause problems for further operations on this dataframe down the road if it isn't reset right away.

M_S_N
  • 2,764
  • 1
  • 17
  • 38
Ryan G
  • 9,184
  • 4
  • 27
  • 27
  • 5
    Can also use ignore_index=True in the concat to avoid dupe indexes. – Andy Hayden Jan 23 '14 at 20:06
  • 5
    @AndyHayden maybe worth noting - you can use `ignore_index=True` to avoid dupe indices, but if you don't use `reset_index`, then you may have skipped indices (since they were dropped) (eg. 0, 1, 2, 4, 5 ...), which may not be desirable – KRish Apr 29 '19 at 17:36
  • reset_index will lost index info. when we need index info. when index is not using number. – Niuya Apr 13 '22 at 13:10
5

In case you have a duplicate row already in DataFrame A, then concatenating and then dropping duplicate rows, will remove rows from DataFrame A that you might want to keep.

In this case, you will need to create a new column with a cumulative count, and then drop duplicates, it all depends on your use case, but this is common in time-series data

Here is an example:

df_1 = pd.DataFrame([
{'date':'11/20/2015', 'id':4, 'value':24},
{'date':'11/20/2015', 'id':4, 'value':24},
{'date':'11/20/2015', 'id':6, 'value':34},])

df_2 = pd.DataFrame([
{'date':'11/20/2015', 'id':4, 'value':24},
{'date':'11/20/2015', 'id':6, 'value':14},
])


df_1['count'] = df_1.groupby(['date','id','value']).cumcount()
df_2['count'] = df_2.groupby(['date','id','value']).cumcount()

df_tot = pd.concat([df_1,df_2], ignore_index=False)
df_tot = df_tot.drop_duplicates()
df_tot = df_tot.drop(['count'], axis=1)
>>> df_tot

date    id  value
0   11/20/2015  4   24
1   11/20/2015  4   24
2   11/20/2015  6   34
1   11/20/2015  6   14
marwan
  • 504
  • 4
  • 14
4

I'm surprised that pandas doesn't offer a native solution for this task. I don't think that it's efficient to just drop the duplicates if you work with large datasets (as Rian G suggested).

It is probably most efficient to use sets to find the non-overlapping indices. Then use list comprehension to translate from index to 'row location' (boolean), which you need to access rows using iloc[,]. Below you find a function that performs the task. If you don't choose a specific column (col) to check for duplicates, then indexes will be used, as you requested. If you chose a specific column, be aware that existing duplicate entries in 'a' will remain in the result.

import pandas as pd

def append_non_duplicates(a, b, col=None):
    if ((a is not None and type(a) is not pd.core.frame.DataFrame) or (b is not None and type(b) is not pd.core.frame.DataFrame)):
        raise ValueError('a and b must be of type pandas.core.frame.DataFrame.')
    if (a is None):
        return(b)
    if (b is None):
        return(a)
    if(col is not None):
        aind = a.iloc[:,col].values
        bind = b.iloc[:,col].values
    else:
        aind = a.index.values
        bind = b.index.values
    take_rows = list(set(bind)-set(aind))
    take_rows = [i in take_rows for i in bind]
    return(pd.concat([a, b.iloc[take_rows,:]]))

# Usage
a = pd.DataFrame([[1,2,3],[1,5,6],[1,12,13]], index=[1000,2000,5000])
b = pd.DataFrame([[1,2,3],[4,5,6],[7,8,9]], index=[1000,2000,3000])

append_non_duplicates(a,b)
#        0   1   2
# 1000   1   2   3    <- from a
# 2000   1   5   6    <- from a
# 5000   1  12  13    <- from a
# 3000   7   8   9    <- from b

append_non_duplicates(a,b,0)
#       0   1   2
# 1000  1   2   3    <- from a
# 2000  1   5   6    <- from a
# 5000  1  12  13    <- from a
# 2000  4   5   6    <- from b
# 3000  7   8   9    <- from b
Antonin
  • 1,748
  • 7
  • 19
  • 24
Daniel Hoop
  • 652
  • 1
  • 5
  • 16
  • And what if only rows where **all** row values are duplicated have to be dropped? Using `col = 0`, as in the example, would drop every row from `b` that strarts with 1. – ns63sr May 12 '19 at 19:00
  • 1
    Usually `isinstance` is use instead of `type(...) is ...` – Winand May 01 '20 at 09:57
2

Another option:

concatenation = pd.concat([
    dfA,
    dfB[dfB['I'].isin(dfA['I']) == False], # <-- get all the data in dfB that doesn't show up in dfB (based on values in column 'I')
])

The object concatenation will be:

     I    II
  0  1    2
  1  3    1
  2  5    6
Matt
  • 460
  • 6
  • 9
  • 1
    This is the one that worked for me. Was worried drop_dupes would drop the wrong copy. The intention here is really clear to read. One suggestion - `== False` is flagged by flake8, which prefers `is False`, however `is False` raises a `KeyError` (I guess as we're working with a _series_ of bools). The syntax I settled on is `~`, which means `not`, ie, `dfB[~dfB['I'].isin(dfA['I'])]`. – Chris Apr 07 '23 at 09:11