0

I have 2 large data frames that are suppose to be duplicates of each other. However there is 3 extra rows in the second data frame. I need to find these 3 extra rows that are not present in the first data frame and remove them from the second data frame so that the data frames are the same. The 3 rows could be located anywhere in the data frame, not just added at the end.

I don't know the most efficient way to go about this. I have tried using the %in% operator along side ! to go through each column in the data to find the rows that differ but this is taking too long as there are over 100 columns. Has anyone got a more efficient way to do such a task?

Thanks

daisy
  • 61
  • 1
  • 6
  • Welcome to StackOverflow! Please read the info about [how to ask a good question](http://stackoverflow.com/help/how-to-ask) and how to give a [reproducible example](http://stackoverflow.com/questions/5963269). This will make it much easier for others to help you. – Sotos Jun 16 '21 at 09:26

2 Answers2

0

I think the most efficient way would be just to use the first data.frame which does not have those extra rows.

But if you need to know where they are in the second you can use in case the rows of the data.frame are unique duplicated:

which(!tail(duplicated(rbind(x, y)), -nrow(x)))
#[1] 4 5

or using interaction and %in%:

which(!interaction(y) %in% interaction(x))
#[1] 4 5

or using paste and %in%:

which(!do.call(paste, y) %in% do.call(paste, x))
#[1] 4 5

Data:

x <- data.frame(a=1:3)
y <- data.frame(a=1:5)
GKi
  • 37,245
  • 2
  • 26
  • 48
0

you can use anti_join methods but in pandas with dataframe they do not exist... so you can do this with merge :

def anti_join(x, y, on):
"""Return rows in x which are not present in y (dataframe)"""
ans = pd.merge(left=x, right=y, how='left', indicator=True, on=on)
ans = ans.loc[ans._merge == 'left_only', :].drop(columns='_merge')
return ans

the first method if you want to check on only 1 column

def anti_join_all_cols(x, y):
    """Return rows in x which are not present in y"""
    assert set(x.columns.values) == set(y.columns.values)
    return anti_join(x, y, x.columns.tolist())

the second for all columns in df

the return will give you only row in df2 NOT IN df, be careful with the direction of the parameters, if we reverse df and df2 the result will not be the same...

you can make :

df_difference = anti_join_all_cols(df2,df) 

source : https://gist.github.com/sainathadapa/eb3303975196d15c73bac5b92d8a210f

Tomo
  • 71
  • 1
  • 7