21
ID     Cat1  Cat2    Cat3   Cat4
A0001   358 11.25   37428   0
A0001   279 14.6875 38605   0
A0013   367 5.125   40152   1
A0014   337 16.3125 38624   0
A0020   367 8.875   37797   0
A0020   339 9.625   39324   0

I need help learning to how remove the unique rows in my file while keeping the duplicates or triplicates. For example, output should look like below:

ID     Cat1  Cat2    Cat3   Cat4
A0001   358 11.25   37428   0
A0001   279 14.6875 38605   0
A0020   367 8.875   37797   0
A0020   339 9.625   39324   0

If you can give me advice how to approach this problem, much appreciated.

Thanks for everyone's suggestions. I wanted to calculate the difference in value in the different Categories (i.e. Cat2, Cat 3) between the repeated measures (by unique ID). Would appreciate any suggestions.

sleepyjoe
  • 267
  • 1
  • 2
  • 11

3 Answers3

21

Another option in base R Using duplicated

dx[dx$ID %in% dx$ID[duplicated(dx$ID)],]

#      ID Cat1    Cat2  Cat3 Cat4
# 1 A0001  358 11.2500 37428    0
# 2 A0001  279 14.6875 38605    0
# 5 A0020  367  8.8750 37797    0
# 6 A0020  339  9.6250 39324    0

data.table using duplicated

using duplicated and fromLast version you get :

library(data.table)
setkey(setDT(dx),ID) # or with data.table 1.9.5+: setDT(dx,key="ID")
dx[duplicated(dx) |duplicated(dx,fromLast=T)]

#       ID Cat1    Cat2  Cat3 Cat4
# 1: A0001  358 11.2500 37428    0
# 2: A0001  279 14.6875 38605    0
# 3: A0020  367  8.8750 37797    0
# 4: A0020  339  9.6250 39324    0

This can be applied to base R also but I prefer data.table here for syntax sugar.

Frank
  • 66,179
  • 8
  • 96
  • 180
agstudy
  • 119,832
  • 17
  • 199
  • 261
  • 1
    This `base` solution is much faster than the solutions in the other answer (didn't try any `data.table` solutions). 6x faster than `ave` and 14x faster than `dplyr`. – hmhensen Dec 29 '18 at 01:59
15

General comments.

  • The ave approach is the only one here that preserves the data's initial row ordering.
  • The by approach should be very slow. I suspect that data.table and dplyr are not much faster than ave and tapply (yet) at selecting groups. Benchmarks to prove me wrong welcome!

base R (Thanks to @thelatemail for both of the first two approaches.)

1) Each row is assigned the length of its df$ID group, and we filter based on the vector of lengths.

df[ ave(1:nrow(df), df$ID, FUN=length) > 1 , ]

2) Alternately, we split row names or numbers by df$ID, selecting which groups' rows to keep. tapply returns a list of groups of rows, so we must unlist them into a single vector of rows.

df[ unlist(tapply(1:nrow(df), df$ID, function(x) if (length(x) > 1) x)) , ]

What follows is a worse approach, but better parallels what you see with data.table and dplyr:

3) The data is split by df$ID, keeping each subset of data, SD if if has more than one row. by returns a list, so we must rbind them back together.

do.call( rbind, c(list(make.row.names = FALSE),
    by(df, df$ID, FUN=function(SD) if (nrow(SD) > 1) SD )))

data.table .N corresponds to nrow within a by=ID group; and .SD is the subset of data.

library(data.table)
setDT(df)[, if (.N>1) .SD, by=ID]

#       ID Cat1    Cat2  Cat3 Cat4
# 1: A0001  358 11.2500 37428    0
# 2: A0001  279 14.6875 38605    0
# 3: A0020  367  8.8750 37797    0
# 4: A0020  339  9.6250 39324    0

dplyr n() corresponds to nrow within a group_by(ID) group.

library(dplyr)
df %>% group_by(ID) %>% filter( n() > 1 )

# Source: local data frame [4 x 5]
# Groups: ID
# 
#      ID Cat1    Cat2  Cat3 Cat4
# 1 A0001  358 11.2500 37428    0
# 2 A0001  279 14.6875 38605    0
# 3 A0020  367  8.8750 37797    0
# 4 A0020  339  9.6250 39324    0
Frank
  • 66,179
  • 8
  • 96
  • 180
  • 1
    Avoid all that splitting and row-binding, use `ave` - `dat[with(dat, ave(seq_along(ID),ID,FUN=length) )>1,]` – thelatemail Aug 27 '15 at 22:58
  • @thelatemail Thanks, that is better. I've edited it in with top billing. [I changed it slightly so that when I refer to how `ID` is used, it's clear which part of the code I'm referring to (the second arg of `ave`.)] – Frank Aug 27 '15 at 23:12
  • 1
    @Frank - you could also make `by` / `tapply` logic work by subsetting by `rownames` - `dat[unlist(tapply(rownames(dat), dat$ID, FUN=function(x) if(length(x)>1) x)),]` – thelatemail Aug 28 '15 at 00:36
  • @thelatemail Thanks again. I've added it and rearranged to fit in some commentary at the top. – Frank Aug 28 '15 at 01:57
  • If I want to make a twist and just take the first row of each unique ID. any suggestions? – sleepyjoe Aug 31 '15 at 17:31
  • @sleepyjoe You mean `df %>% distinct(ID)` or with data.table `unique(df,by="ID")` – Frank Aug 31 '15 at 17:52
1

I know this is an old question, but i was just having the same problem and found this solution the easiest:

data<- data[duplicated(data$ID)]
Eli
  • 43
  • 6