1

I have a dataframe like:

ID        DATE  N  Price
 1  2013-02-04  3  29.99
 1  2013-03-18  1   9.99
 1  2013-04-13  2  19.99
 2  2013-02-18  1  18.99
 2  2013-05-11  2  19.99

This answer Extract rows for the first occurrence of a variable in a data frame tells how to extract the FIRST occurrences of a target value ... but I need ALL BUT the first occurrence, aka :

ID        DATE  N  Price
 1  2013-03-18  1   9.99
 1  2013-04-13  2  19.99
 2  2013-05-11  2  19.99

What is the recommended approach to this?

My initial instinct was to use the approach outlined in the linked answer, build a "first" subset, and then say, "From the original data frame, give me everything except these values" ... but that seemed perhaps more complicated.

Community
  • 1
  • 1
Dan
  • 4,197
  • 6
  • 34
  • 52

3 Answers3

6

You can do this with the data.table package quite easily.

library(data.table)
setDT(df)[, .SD[-1], by = ID]
#    ID       DATE N Price
# 1:  1 2013-03-18 1  9.99
# 2:  1 2013-04-13 2 19.99
# 3:  2 2013-05-11 2 19.99

where df is your original data. This removes the first row for each group, grouped by ID.

Another option is the dplyr package.

library(dplyr)
slice(group_by(df, ID), -1)
#      ID       DATE     N Price
#   (int)     (fctr) (int) (dbl)
# 1     1 2013-03-18     1  9.99
# 2     1 2013-04-13     2 19.99
# 3     2 2013-05-11     2 19.99

These remove the first row for all groups. You don't specify what should happen if there is only one row for a group. Should you need to keep those rows, you would need to account for that. So let's add a single row as a group and take a look.

dff <- rbind(df, df[4, ])
dff[6, 1] <- 3

Then the data.table code would be

setDT(dff)[, .SD[if(.N == 1L) 1 else -1], by = ID]
#    ID       DATE N Price
# 1:  1 2013-03-18 1  9.99
# 2:  1 2013-04-13 2 19.99
# 3:  2 2013-05-11 2 19.99
# 4:  3 2013-02-18 1 18.99

and the dplyr code would be

slice(group_by(dff, ID), if(n() == 1L) 1 else -1)
#      ID       DATE     N Price
#   (dbl)     (fctr) (int) (dbl)
# 1     1 2013-03-18     1  9.99
# 2     1 2013-04-13     2 19.99
# 3     2 2013-05-11     2 19.99
# 4     3 2013-02-18     1 18.99

for those situations.

Data:

df <- structure(list(ID = c(1L, 1L, 1L, 2L, 2L), DATE = structure(c(1L, 
3L, 4L, 2L, 5L), .Label = c("2013-02-04", "2013-02-18", "2013-03-18", 
"2013-04-13", "2013-05-11"), class = "factor"), N = c(3L, 1L, 
2L, 1L, 2L), Price = c(29.99, 9.99, 19.99, 18.99, 19.99)), .Names = c("ID", 
"DATE", "N", "Price"), class = "data.frame", row.names = c(NA, 
-5L))
Rich Scriven
  • 97,041
  • 11
  • 181
  • 245
5

If you don't want to use add-on packages:

df[duplicated(df$ID),]
Ott Toomet
  • 1,894
  • 15
  • 25
  • I like this for its simplicity. The best I came up with was `df[-match(unique(df$ID), df$ID),]` , but this is even better. – thelatemail Nov 04 '15 at 02:46
1

Another option is ave

 df[with(df, ave(ID, ID, FUN= seq_along)!=1),]
 #  ID       DATE N Price
 #2  1 2013-03-18 1  9.99
 #3  1 2013-04-13 2 19.99
 #5  2 2013-05-11 2 19.99
akrun
  • 874,273
  • 37
  • 540
  • 662