R filtering out a subset

Question

I have a data.frame A and a data.frame B which contains a subset of A

How can I create a data.frame C which is data.frame A with data.frame B excluded? Thanks for your help.

Please revise your question following the guidelines outlined here: http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example — Chase, Apr 28 '12 at 20:10

score 3 · Accepted Answer · answered Apr 28 '12 at 20:23

3

get the rows in A that aren't in B

C = A[! data.frame(t(A)) %in% data.frame(t(B)), ]

answered Apr 28 '12 at 20:23

Matthew Plourde

43,932
7
96
113

This may not be the fastest but it seems to be the safest as it accounts for the data in the rows and not the row names. Row names could be risky if the row names have been mixed up or rearranged. +1 – Tyler Rinker Apr 28 '12 at 20:59
2

@Tyler - it is safest if the OP provides an example of what the data actually looks like so it is not left up to the imagination of those trying to answer :) – Chase Apr 28 '12 at 21:05
It does not seem to work with mixed types: `A <- data.frame(x = 1:4, y = as.character(1:4)); B <- A[1:2, ]` – flodel Apr 28 '12 at 21:12
1

@ Chase I fully agree with that. If we knew then we could boost speed. Alas we shall remain as mushrooms. – Tyler Rinker Apr 28 '12 at 21:12
@flodel I'd argue it does in that a character row is not a numeric row (cell). I'd say it did what it was supposed to. – Tyler Rinker Apr 28 '12 at 21:14
@flodek I strike my comment from the record and you stand correct. I see what you're saying which prompts me to post a response from the non existant qdap package. – Tyler Rinker Apr 28 '12 at 21:18

score 2 · Answer 2 · answered Apr 28 '12 at 21:43

If this B data set is truly a nested version of the first data set there has to be indexing that created this data set to begin with. IMHO we shouldn't be discussing the differences between the data sets but negating the original indexing that created the B data set to begin with. Here's an example of what I mean:

A <- mtcars
B <- mtcars[mtcars$cyl==6, ]
C <- mtcars[mtcars$cyl!=6, ]

score 1 · Answer 3 · answered Apr 28 '12 at 20:30

A <- data.frame(x = 1:10, y = 1:10)
#Random subset of A in B
B <- A[sample(nrow(A),3),]
#get A that is not in B
C <- A[-as.integer(rownames(B)),]

Performance test vis-a-vis mplourde's answer:

library(rbenchmark)
f1 <- function() A[- as.integer(rownames(B)),]
f2 <- function() A[! data.frame(t(A)) %in% data.frame(t(B)), ]
benchmark(f1(), f2(), replications = 10000, 
          columns = c("test", "elapsed", "relative"),
          order = "elapsed"
          )

  test elapsed relative
1 f1()   1.531   1.0000
2 f2()   8.846   5.7779

Looking at the rownames is approximately 6x faster. Two calls to transpose can get expensive computationally.

flodel · Answer 4 · 2012-04-28T21:04:08.637

1

If B is truly a subset of A, which you can check with:

if(!identical(A[rownames(B), , drop = FALSE], B)) stop("B is not a subset of A!")

then you can filter by rownames:

C <- A[!rownames(A) %in% rownames(B), , drop = FALSE]

or

C <- A[setdiff(rownames(A), rownames(B)), , drop = FALSE]

edited Apr 28 '12 at 21:04

answered Apr 28 '12 at 20:44

flodel

87,577
21
185
223

mnel · Answer 5 · 2012-10-08T06:08:08.787

Here are two data.table solutions that will be memory and time efficient

render_markdown(strict = T)
library(data.table)
# some biggish data
set.seed(1234)
ADT <- data.table(x = seq.int(1e+07), y = seq.int(1e+07))

.rows <- sample(nrow(ADT), 30000)
# Random subset of A in B
BDT <- ADT[.rows, ]

# set keys for fast merge
setkey(ADT, x)
setkey(BDT, x)
## how CDT <- ADT[-ADT[BDT,which=T]] the data as `data.frames for fastest
## alternative
A <- copy(ADT)
setattr(A, "class", "data.frame")
B <- copy(BDT)
setattr(B, "class", "data.frame")
f2 <- function() noBDT <- ADT[-ADT[BDT, which = T]]
f3 <- function() noBDT2 <- ADT[-BDT[, x]]
f1 <- function() noB <- A[-as.integer(rownames(B)), ]

library(rbenchmark)
benchmark(base = f1(),DT = f2(), DT2 = f3(), replications = 3)

##   test replications elapsed relative user.self sys.self 
## 2   DT            3    0.92    1.108      0.77     0.15       
## 1  base           3    3.72    4.482      3.19     0.52        
## 3  DT2            3    0.83    1.000      0.72     0.11

Not sure that's fair to base: most of that time is converting rownames to integer, isn't it? The data.table joins should also be faster with mult="first" since you know the key is unique (known slow down bug). This example data could confuse with key the same as row numbers (theres no need to join at all). — Matt Dowle, Oct 08 '12 at 07:12

score 0 · Answer 6 · answered Apr 28 '12 at 21:22

This is not the fastest and is likely to be very slow but is an alternative to mplourde's that takes into account the row data and should work on mixed data which flodel critiqued. It relies on the paste2 function from the qdap package which doesn't exist yet as I plan to release it within the enxt month or 2:

Paste 2 function:

paste2 <- function(multi.columns, sep=".", handle.na=TRUE, trim=TRUE){

    if (trim) multi.columns <- lapply(multi.columns, function(x) {
            gsub("^\\s+|\\s+$", "", x)
        }
    )

    if (!is.data.frame(multi.columns) & is.list(multi.columns)) {
        multi.columns <- do.call('cbind', multi.columns)
      }

    m <- if(handle.na){
                 apply(multi.columns, 1, function(x){if(any(is.na(x))){
                       NA
                 } else {
                       paste(x, collapse = sep)
                 }
             }
         )   
         } else {
          apply(multi.columns, 1, paste, collapse = sep)
    }
    names(m) <- NULL
    return(m)
}

# Flodel's mixed data set:

A <- data.frame(x = 1:4, y = as.character(1:4)); B <- A[1:2, ]

# My approach:

A[!paste2(A)%in%paste2(B), ]

R filtering out a subset

6 Answers6

Linked

Related