4

Possible Duplicate:
Existing function for seeing if a row exists in a data frame?

Suppose I have the following data frame in R.

df = data.frame('a'=c(1:3), 'b'=c(4:6))

This data frame contains three rows: (1,4), (2,5) and (3,6). Suppose I did not know which rows df contains and wanted to check whether a row (1,4) belongs to it, how can I check that?

My actual case involves comparison of 27 parameter values. Is there a solution in which I can do this without typing each and every parameter name? Thanks!

The reason I want to do this is that I have an R dataset called masterdata which contains simulation data. I want to update this data set with new data that is obtained as I make additional simulation runs with different parameter combinations. It is possible, however, that I may forget that I have run the simulation for a certain parameter combination and may run it again, in which case, the masterdata will be expanded with duplicate values. I can later go and remove these duplicate values, but I would not want the whole process of updating the data set to go through if the values are duplicate. For this I need to check if the data from a simulation run is already present in the masterdata. I can do this if I know how to check whether a given row belongs to the masterdata.

Thanks.

Community
  • 1
  • 1
Curious2learn
  • 31,692
  • 43
  • 108
  • 125
  • 1
    You might find some ideas in this earlier question: [Existing function for seeing if a row exists in a data frame?](http://stackoverflow.com/questions/5916854/existing-function-for-seeing-if-a-row-exists-in-a-data-frame) – Marek Jun 13 '11 at 09:21
  • Thanks for the link Marek. Did not know about that thread. – Curious2learn Jun 13 '11 at 11:19
  • There are two solutions there, one by you (which is similar to the one here) and one by Hadley. Is one faster than the other? Thanks. – Curious2learn Jun 13 '11 at 11:34
  • @Curious2learn I think it's depends on data: number of rows, number of columns and types of columns. – Marek Jun 13 '11 at 11:59
  • @Curious2learn I run some tests and it seems that Hadley's is much faster (for wide data.frame ~3x faster). – Marek Jun 13 '11 at 12:27
  • I vote for reopen -- the true aim of the OP is remove duplicated rows, so this is a different question than the previous one. – mbq Jun 13 '11 at 17:39

4 Answers4

6

There may be more efficient ways, but I think

tail(duplicated(rbind(masterdata,newvals)),1)

will do it: in other words, attach the new row to the end of the data frame and see whether it is duplicated or not.

Ben Bolker
  • 211,554
  • 25
  • 370
  • 453
  • [I agree with your answer](http://stackoverflow.com/questions/5916854/existing-function-for-seeing-if-a-row-exists-in-a-data-frame/5917042#5917042) ;) – Marek Jun 13 '11 at 09:31
2

If you want to compare only two columns in the data.frame, then this does a trick:

> which(df$a+df$b*1i == 1+4i)
[1] 1

This may or may not be faster than other vectorized solution.

kohske
  • 65,572
  • 8
  • 165
  • 155
1

Quite a few ways to do this. You can use ifelse() which is a vectorized solution to return a boolean value for each row of your dataframe if it meets your conditions.

> with(df, ifelse(a == 1 & b == 4, 1, 0))
[1] 1 0 0

Since you are probably only interested in knowing whether your parameter combination has been run at all, you can wrap sum() around the previous command:

> sum(with(df, ifelse(a == 1 & b == 4, 1, 0)))
[1] 1

Another alternative is to use nrow() and subset(). We'll again use the & operator for our testing:

> nrow(subset(df, a == 1 & b == 4))
[1] 1
Chase
  • 67,710
  • 18
  • 144
  • 161
  • My actual case involves comparison of 27 parameter values. Is there a vectorized solution so that I do not have to type each and every parameter name? Thanks! – Curious2learn Jun 13 '11 at 02:17
  • @Curious2learn - see @Ben's answer for the path to enlightenment. He's steering you in the right direction there. – Chase Jun 13 '11 at 11:27
-1

You don't need any more than a single unique call:

Test<-data.frame(a=c(1,2,2,2,3),b=c(1,2,2,3,3),c=(1,2,2,2,3))
Test
unique(Test) #Same with duplicated rows removed
mbq
  • 18,510
  • 6
  • 49
  • 72