Subset dataframe based on duplicate values in multiple rows

Question

How can I subset the dataframe below to show only the rows were columns AAA:CCC all have the same value and retain the IndID field?

Dat <- data.frame(IndID = LETTERS[seq(1,10)],
                  AAA = c(1,5,3,2,3,1,5,4,6,2),
                  BBB = c(1,8,5,2,5,4,8,4,4,5),
                  CCC = c(1,5,3,2,3,5,5,4,6,5))

> Dat
   IndID AAA BBB CCC
1      A   1   1   1
2      B   5   8   5
3      C   3   5   3
4      D   2   2   2
5      E   3   5   3
6      F   1   4   5
7      G   5   8   5
8      H   4   4   4
9      I   6   4   6
10     J   2   5   5

I would like to return the following result.

Result <- data.frame(IndID = c("A", "D", "H"),
                  AAA = c(1,2,4),
                  BBB = c(1,2,4),
                  CCC = c(1,2,4))

> Result
  IndID AAA BBB CCC
1     A   1   1   1
2     D   2   2   2
3     H   4   4   4

I have found a number of related posts including Find duplicated rows (based on 2 columns) in Data Frame in R and Find duplicated elements with dplyr, among others, but have not been able to reproduce the desired result with three columns. For example, while close the code below shows all distinct rows, but with the undesired result of including rows were only two of the values are equal.

Dat %>% distinct(AAA, BBB, CCC)

I suspect the solution involves filter but an not sure how to obtain the desired result from the example mentioned above. A dplyr solution is preferred.

Addition I also wonder if this could be applied to multiple factors. For example with the data below, which contains an IndID and three dates stored as factors, can a similar result be produced by either allowing factors or changing the factors to numeric values?

Dat <- structure(list(GenIndID = structure(c(1L, 2L, 6L, 7L, 3L, 4L, 
8L, 5L), .Label = c("BHS_601", "BHS_603", "BHS_604", "BHS_605", 
"BHS_631", "BHS_635", "BHS_636", "BHS_637"), class = "factor"), 
    CptrDate = structure(c(1L, 2L, 3L, 3L, 2L, 2L, 3L, 4L), .Label = c("2016-02-01", 
    "2016-02-02", "2016-02-04", "2016-12-11"), class = "factor"), 
    DtLastAlive = structure(c(2L, 2L, 1L, 1L, 2L, 2L, 1L, 3L), .Label = c("2016-02-04", 
    "2017-07-13", "2017-08-27"), class = "factor"), DtFnlFate = structure(c(2L, 
    2L, 1L, 1L, 2L, 2L, 1L, 3L), .Label = c("2016-02-04", "2017-07-13", 
    "2017-08-27"), class = "factor")), .Names = c("GenIndID", 
"CptrDate", "DtLastAlive", "DtFnlFate"), row.names = c(82L, 83L, 
224L, 225L, 84L, 85L, 226L, 360L), class = "data.frame")


> Dat
    GenIndID   CptrDate DtLastAlive  DtFnlFate
82   BHS_601 2016-02-01  2017-07-13 2017-07-13
83   BHS_603 2016-02-02  2017-07-13 2017-07-13
224  BHS_635 2016-02-04  2016-02-04 2016-02-04
225  BHS_636 2016-02-04  2016-02-04 2016-02-04
84   BHS_604 2016-02-02  2017-07-13 2017-07-13
85   BHS_605 2016-02-02  2017-07-13 2017-07-13
226  BHS_637 2016-02-04  2016-02-04 2016-02-04
360  BHS_631 2016-12-11  2017-08-27 2017-08-27

with the desired result being

> Dat[c(3, 4, 7),]
    GenIndID   CptrDate DtLastAlive  DtFnlFate
224  BHS_635 2016-02-04  2016-02-04 2016-02-04
225  BHS_636 2016-02-04  2016-02-04 2016-02-04
226  BHS_637 2016-02-04  2016-02-04 2016-02-04

`Dat[apply(Dat[-1], 1, function(x) length(unique(x)) == 1),]` — d.b, Nov 15 '17 at 16:15

acylam · Accepted Answer · 2017-11-15T17:37:52.273

6

Here's another solution with apply and all:

Dat[apply(Dat[,-1], 1, function(x) all(x==x[1])),]

or with filter_at from dplyr:

library(dplyr)
Dat %>% 
  filter_at(vars(AAA:CCC), all_vars(. == .data$AAA))

Result:

  IndID AAA BBB CCC
1     A   1   1   1
4     D   2   2   2
8     H   4   4   4

Edit:

In response to OP's additional example, the apply example would work regardless of variable type. So the following would work for the new example:

Dat[apply(Dat[,-1], 1, function(x) all(x==x[1])),]

Result:

    GenIndID   CptrDate DtLastAlive  DtFnlFate
224  BHS_635 2016-02-04  2016-02-04 2016-02-04
225  BHS_636 2016-02-04  2016-02-04 2016-02-04
226  BHS_637 2016-02-04  2016-02-04 2016-02-04

For filter_at, however, if the columns to compare are factor's, one would need to first convert to character:

Dat %>% 
  filter_at(vars(-1), all_vars(as.character(.) == .data$CptrDate))

Notice that you only need to convert . to character, but not .data$CptrDate because factors can be compered with characters, but not with another factor with different levels.

An alternative would be:

Dat %>%
  mutate_at(vars(-1), as.character) %>%
  filter_at(vars(-1), all_vars(. == .data$CptrDate))

Result:

  GenIndID   CptrDate DtLastAlive  DtFnlFate
1  BHS_635 2016-02-04  2016-02-04 2016-02-04
2  BHS_636 2016-02-04  2016-02-04 2016-02-04
3  BHS_637 2016-02-04  2016-02-04 2016-02-04

edited Nov 15 '17 at 17:37

answered Nov 15 '17 at 16:34

acylam

18,231
5
36
45

is `filter_at` a new addition to `dplyr`? – B. Davis Nov 15 '17 at 16:43
@B.Davis I think it's been a while since release. What version of `dplyr` do you have? Check `sessionInfo()` – acylam Nov 15 '17 at 16:47
yea. I had 0.5.0 but just reinstalled and now have `filter_at` – B. Davis Nov 15 '17 at 16:48
@B.Davis Great. There is also `filter_all`, `summarize_at`, `summarize_all` etc. So for example if you don't have the `IndID` column, I would use `filter_all` – acylam Nov 15 '17 at 16:49
This works for my numeric data! perfect! I am also wondering if it can be applied to factors and have added an addition above. Thanks in advance! – B. Davis Nov 15 '17 at 17:15
@B.Davis For `apply`, no need to convert. For `filter_at`, you will need to first convert to character. See my edits. – acylam Nov 15 '17 at 17:34

score 2 · Answer 2 · answered Nov 15 '17 at 16:53

2

If all you need is a subset of 3 columns you can use the DF[,] operator.

# DF[where rows have a value, select columns]
# [where rows where AAA==BBB==CCC, select all columns with ""]
temp <- Dat[Dat$AAA == Dat$BBB & Dat$BBB == Dat$CCC,]

answered Nov 15 '17 at 16:53

Adam Sampson

1,971
1
7
15

KenHBS · Answer 3 · 2017-11-15T16:31:36.853

You could use a combination of range() and diff().

Dat[apply(Dat[ ,-1], 1, function(x) diff(range(x)))==0, ] 
#   IndID AAA BBB CCC
# 1     A   1   1   1
# 4     D   2   2   2
# 8     H   4   4   4

range() gives you the minimum and the maximum of a vector. diff() gives you the difference between values in a vector. If the difference between min-value and max-value is zero, you know all values are equal. If we apply this row-wise using apply, we get a TRUE/FALSE vector that we can use to index Dat.

Small examples to illustrate the logic:

test <- c(1, 5, 3)
minmax <- range(test)    # gives c(1,5)
diff(minmax)             # gives 4
diff(range(c(1, 1, 1)))  # gives 0

If we check for every row, whether or not diff(range(your_row)) equals zero, we can use the output of that to index Dat, like we did above.

score 1 · Answer 4 · answered Nov 15 '17 at 23:50

1

You can have a vectorized solution:

Dat[do.call(function(...) pmax(...) - pmin(...), Dat[, -1]) == 0,]
#  IndID AAA BBB CCC
#1     A   1   1   1
#4     D   2   2   2
#8     H   4   4   4

answered Nov 15 '17 at 23:50

eddi

49,088
6
104
155

Subset dataframe based on duplicate values in multiple rows

4 Answers4