23

I'm trying to collapse a data frame by removing all but one row from each group of rows with identical values in a particular column. In other words, the first row from each group.

For example, I'd like to convert this

> d = data.frame(x=c(1,1,2,4),y=c(10,11,12,13),z=c(20,19,18,17))
> d
  x  y  z
1 1 10 20
2 1 11 19
3 2 12 18
4 4 13 17

Into this:

    x  y  z
1   1 11 19
2   2 12 18
3   4 13 17

I'm using aggregate to do this currently, but the performance is unacceptable with more data:

> d.ordered = d[order(-d$y),]
> aggregate(d.ordered,by=list(key=d.ordered$x),FUN=function(x){x[1]})

I've tried split/unsplit with the same function argument as here, but unsplit complains about duplicate row numbers.

Is rle a possibility? Is there an R idiom to convert rle's length vector into the indices of the rows that start each run, which I can then use to pluck those rows out of the data frame?

jogo
  • 12,469
  • 11
  • 37
  • 42
jkebinger
  • 3,944
  • 4
  • 19
  • 14

4 Answers4

31

Maybe duplicated() can help:

R> d[ !duplicated(d$x), ]
  x  y  z
1 1 10 20
3 2 12 18
4 4 13 17
R> 

Edit Shucks, never mind. This picks the first in each block of repetitions, you wanted the last. So here is another attempt using plyr:

R> ddply(d, "x", function(z) tail(z,1))
  x  y  z
1 1 11 19
2 2 12 18
3 4 13 17
R> 

Here plyr does the hard work of finding unique subsets, looping over them and applying the supplied function -- which simply returns the last set of observations in a block z using tail(z, 1).

Dirk Eddelbuettel
  • 360,940
  • 56
  • 644
  • 725
  • So then you need to simply add a 'processing step' to create a factor variable over which plyr can loop. It can all be done with indexing commands, give it a try. And by the way, you are inconsistent between your text (saying first row selected) and example (showing second row). – Dirk Eddelbuettel Apr 13 '10 at 02:51
  • By the way, cross-posting between r-help and here is also somewhat poor style. You got good answers at r-help, so why don't you study them? – Dirk Eddelbuettel Apr 13 '10 at 02:59
  • My pleasure. As a matter of common best practices here on StackOverflow, you should accept one post as the solutions (if you feel it provides one) and vote each helpful post up by clicking on the up arrow. That is how the scoring works here. – Dirk Eddelbuettel Apr 13 '10 at 13:36
14

Here is a data.table solution which will be time and memory efficient for large data sets

library(data.table)
DT <- as.data.table(d)           # convert to data.table
setkey(DT, x)                    # set key to allow binary search using `J()`
DT[J(unique(x)), mult ='last']   # subset out the last row for each x
DT[J(unique(x)), mult ='first']  # if you wanted the first row for each x
Matt Dowle
  • 58,872
  • 22
  • 166
  • 224
mnel
  • 113,303
  • 27
  • 265
  • 254
  • But if _all_ that is needed is the last row in each group, then `DT[!duplicated(x,fromLast=TRUE)]` is likely faster than the total time of `setkey` + join, and with some syntactic sugar advantage of avoiding variable name repetition of `DT` (i.e. just `x` not `DT$x`). – Matt Dowle Sep 19 '12 at 08:43
  • Using the row index would speed things up i geuss, DT[ DT[, .I[.N] , by = x]$V1]. Check http://stackoverflow.com/questions/19424762/efficiently-selecting-top-number-of-rows-for-each-unique-value-of-a-column-in-a . Thank to @Simono101 – Freddy Nov 20 '13 at 10:36
  • 2
    `unique(DT,by="x",fromLast=TRUE)` is now simpler and faster than `DT[!duplicated(x,fromLast=TRUE)]` and `DT[J(unique(x)), mult ='last']` – Matthew Sep 02 '14 at 01:50
14

Just to add a little to what Dirk provided... duplicated has a fromLast argument that you can use to select the last row:

d[ !duplicated(d$x,fromLast=TRUE), ]
Ian Fellows
  • 17,228
  • 10
  • 49
  • 63
  • 1
    Hi Ian -- unfortunately James never really made a clear case as to whether he wanted first or last and contradicts himself in the post ... but your hint about fromLast is a good one! – Dirk Eddelbuettel Apr 13 '10 at 12:14
  • thanks, that works like a charm. Whether its first or last I needed was really up to the ordering, and with fromLast I can attack it either way – jkebinger Apr 13 '10 at 12:54
  • I suggested the same thing and you shot it down on on the grounds of 'prefer all columns'. How come that no longer matters? – Dirk Eddelbuettel Apr 13 '10 at 13:35
  • Sorry, Dirk, I misunderstood how duplicated works at the time – jkebinger Apr 15 '10 at 15:17
5

There are a couple options using dplyr:

library(dplyr)
df %>% distinct(x, .keep_all = TRUE)
df %>% group_by(x) %>% filter(row_number() == 1)
df %>% group_by(x) %>% slice(1)

You can use more than one column with both distinct() and group_by():

df %>% distinct(x, y, .keep_all = TRUE)

The group_by() and filter() approach can be useful if there is a date or some other sequential field and you want to ensure the most recent observation is kept, and slice() is useful if you want to avoid ties:

df %>% group_by(x) %>% filter(date == max(date)) %>% slice(1)
sbha
  • 9,802
  • 2
  • 74
  • 62
  • dplyr::top_n() is also an option see https://stackoverflow.com/questions/13279582/select-the-first-row-by-group/50955051#50955051 – Kresten Apr 08 '19 at 08:14