0

I need to create a new data frame two existing data frames where the new data frame is each row from the first data frame that is not in the second data frame. I found some code here using the merge function that allowed me to do it this way. Basically, if the resulting merge has a result then the row is in the data frame and I don't add it to my new one:

for (j in 1:nrow(my.df)) {
    if(nrow(merge(my.df[j,],sample.df))==0) {
        test.df <- rbind(test.df,my.df[j,])
    }
}

The problem is that the for loop is very slow. Is there a more efficient way to build a data frame given the constraints I have?

my.df

A B class
1 2 x
2 3 y
3 4 z

sample.df

A B class
1 2 x

test.df should look like

A B class
2 3 y
3 4 z
Community
  • 1
  • 1
xjtc55
  • 389
  • 2
  • 4
  • 14
  • 1
    look at `?dplyr::setdiff()` and `?dplyr::anti_join()` for help beyond that please post a reproducible example with ideal output :) – Nate Nov 21 '16 at 21:28
  • The code is reproducible, you just need two data frames and you will get the desired output that I am looking for (the code works as is). I am just looking for a faster way. – xjtc55 Nov 21 '16 at 21:32
  • what does `my.df` look like? how about `sample.df`? – Nate Nov 21 '16 at 21:34
  • I have included an example of what a data frame looks like – xjtc55 Nov 21 '16 at 21:37

1 Answers1

2

Using library(dplyr) we can use anti_join():

anti_join(my.df, sample.df)
# Joining, by = c("A", "B", "class")
#   A B class
# 1 3 4     z
# 2 2 3     y

As mentioned by @Gregor, you can also convert your data.frames into data.tables with library(data.table) to get some extra quickness

Gregor Thomas
  • 136,190
  • 20
  • 167
  • 294
Nate
  • 10,361
  • 3
  • 33
  • 40
  • This works. There was noticeable speedup. Enough for now. I will try converting to a table for even more speed. – xjtc55 Nov 21 '16 at 21:51
  • 1
    yes if you specify `anti_join(..., by = c("variable1, variable2))`, the warning is there to show which columns it is choosing to match on, which is by default all common names shared between the two data.frames – Nate Nov 21 '16 at 22:14