Subset a data frame based on another 2

Question

Trying to implement an answer to the question posed here , but can't see where I'm going wrong.

I am trying to extract from one dataframe observations listed in another dataframe using a common field. The question cited was not exactly the same but an answer suggested using "setdiff" for the related question, which seems to fit my need.

Here's the example I set up to try it:

    # orginal dataframe
    origdf <- data.frame(apple = c(111, 2, 4, "fox"), 
             orange=c( 222, 11, 12, 14), 
             pear=c( "one", "two", 10, 11),
             peach=c("which", "way", "to", "go"),
             banana=c(333, 22, 23, 24),
             grape=c(77, 78, 79, 80))
    origdf
    # a separate process produces a dataframe with observations to be extracted from the original dataframe
    extract <- origdf[which(origdf$apple == 111 |
                origdf$apple == "fox"), ]
    extract
    test <- origdf[setdiff(origdf$apple, extract$apple)]
    test

    # the above returns an error that "undefined columns selected", but the following works...
    origdf$apple
    extract$apple

Why am I having this problem?

You missed a comma: `test <- origdf[setdiff(origdf$apple, extract$apple),]`. Without it, R thinks that you are subsetting columns, which you obviously are not. — acylam, Nov 10 '17 at 17:04
Silly mistake, no? But I was expecting to see rows 2 and 3 remaining after the extract, but instead I see rows 2 and 4. — W Barker, Nov 10 '17 at 17:28

score 1 · Accepted Answer · answered Nov 10 '17 at 17:46

As mentioned in the comment, you received an error because you missed a comma in:

test <- origdf[setdiff(origdf$apple, extract$apple)]

Without it, R thinks that you are subsetting columns. Hence "undefined columns selected".

Your second issue is with the use of setdiff for indexing. When subsetting the rows of a data.frame, you need to either provide indices or a logical vector indicating whether a specific row should be included in the final subset. The following, however,

setdiff(origdf$apple, extract$apple)

returns:

[1] "2" "4"

This will be implicitly coerced to c(2, 4) when calling:

test <- origdf[setdiff(origdf$apple, extract$apple),]

since R thinks you are subsetting by index. Therefore returning:

  apple orange pear peach banana grape
2     2     11  two   way     22    78
4   fox     14   11    go     24    80

To return what you actually want, you can use %in% which returns a logical vector of whether origdf$apple is in the setdiff:

test <- origdf[origdf$apple %in% setdiff(origdf$apple, extract$apple),]

returns:

  apple orange pear peach banana grape
2     2     11  two   way     22    78
3     4     12   10    to     23    79

An alternative would be to check whether origd$apple is %in% extract$apple and return rows that are not (!):

test <- origdf[!origdf$apple %in% extract$apple,]

So I both make a silly mistake and picked an unfortunate example! thanks. — W Barker, Nov 10 '17 at 19:16
@WBarker Glad that my solution helped. If you think that this answers your question, you may accept it by clicking on the grey check mark under the downvote button :) — acylam, Nov 10 '17 at 19:32
@ useR It did work. Continuing my research, it appears that the following also works: require(dplyr); anti_join(origdf,extract) (not sure how to add code in comments section). — W Barker, Nov 10 '17 at 22:34

Subset a data frame based on another 2

1 Answers1