2

I have a data frame with two columns. The first column defines subsets of the data. I want to find all values in the second column that only appear in one subset in the first column.

For example, from:

df=data.frame(
  data_subsets=rep(LETTERS[1:2],each=5),
  data_values=c(1,2,3,4,5,2,3,4,6,7))

data_subsets data_values
      A           1
      A           2
      A           3
      A           4
      A           5
      B           2
      B           3
      B           4
      B           6
      B           7

I would want to extract the following data frame.

data_subsets   data_values
    A              1
    A              5
    B              6
    B              7

I have been playing around with duplicated but I just can't seem to make it work. Any help is appreciated. There are a number of topics tackling similar problems, I hope I didn't overlook the answer in my searches!

EDIT

I modified the approach from @Matthew Lundberg of counting the number of elements and extracting from the data frame. For some reason his approach was not working with the data frame I had, so I came up with this, which is less elegant but gets the job done:

counts=rowSums(do.call("rbind",tapply(df$data_subsets,df$data_values,FUN=table)))
extract=names(counts)[counts==1]
df[match(extract,df$data_values),]
jslefche
  • 4,379
  • 7
  • 39
  • 50

4 Answers4

5

First, find the count of each element in df$data_values:

 x <- sapply(df$data_values, function(x) sum(as.numeric(df$data_values == x)))

> x
 [1] 1 2 2 2 1 2 2 2 1 1

Now extract the rows:

> df[x==1,]
   data_subsets data_values
1             A           1
5             A           5
9             B           6
10            B           7

Note that you missed "A 5" above. There is no "B 5".

Matthew Lundberg
  • 42,009
  • 6
  • 90
  • 112
  • Great, thanks--edited above to reflect my mistake! When applying to my own data set, I get the error "level sets of factors are different". Will investigate and report back.. – jslefche Jun 03 '12 at 19:37
  • Works for the example data set so upvote, but for some reason, wouldn't work for my real data set. See the edit up top to see my (inelegant) solution – jslefche Jun 03 '12 at 22:16
2

You had the right idea with duplicated. The trick is to combine fromLast = TRUE and fromLast = FALSE options to get a full list of non-duplicated rows.

!duplicated(df$data_values,fromLast = FALSE)&!duplicated(df$data_values,fromLast = TRUE)
 [1]  TRUE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE  TRUE  TRUE

Indexing your data.frame with this vector gives:

df[!duplicated(df$data_values,fromLast = FALSE)&!duplicated(df$data_values,fromLast = TRUE),]
   data_subsets data_values
1             A           1
5             A           5
9             B           6
10            B           7
Pierre Lapointe
  • 16,017
  • 2
  • 43
  • 56
  • Thanks! Seems to work well for the example data set but I had difficulties applying to my own data set. See my edit up top for my (inelegant) solution – jslefche Jun 03 '12 at 22:16
0

A variant of P Lapointe's answer would be

df[! df$data_values %in% df[duplicated( unique(df)$data_values ), ]$data_values,]

The unique() deals with the possibility (not in your test data) that some rows in the data may be identical and you want to keep them once if the same data_values does not appear for distinct data_sets (or distinct other columns).

Henry
  • 6,704
  • 2
  • 23
  • 39
0

You can use the 'dplyr' and 'explore' library to overcome this problem.

library(dplyr)
library(explore)

df=data.frame(
  data_subsets=rep(LETTERS[1:2],each=5),
  data_values=c(1,2,3,4,5,2,3,4,6,7))

df  %>% describe(data_subsets)

######## output ########
#variable = data_subsets
#type     = character
#na       = 0 of 10 (0%)
#unique   = 2
# A       = 5 (50%)
# B       = 5 (50%)