Subset a data frame based on column entry (or rank)

Question

I have a data.frame as simple as this one:

id group idu  value
1  1     1_1  34
2  1     2_1  23
3  1     3_1  67
4  2     4_2  6
5  2     5_2  24
6  2     6_2  45
1  3     1_3  34
2  3     2_3  67
3  3     3_3  76

from where I want to retrieve a subset with the first entries of each group; something like:

id group idu value
1  1     1_1 34
4  2     4_2 6
1  3     1_3 34

id is not unique so the approach should not rely on it.

Can I achieve this avoiding loops?

data <- data.frame(
  id = c(1L, 2L, 3L, 4L, 5L, 6L, 1L, 2L, 3L),
  group = rep(1:3, each = 3L),
  idu = factor(c("1_1", "2_1", "3_1", "4_2", "5_2", "6_2", "1_3", "2_3", "3_3")),
  value = c(34L, 23L, 67L, 6L, 24L, 45L, 34L, 67L, 76L)
)

score 10 · Accepted Answer · answered Apr 28 '11 at 14:37

10

Using Gavin's million row df:

DF3 <- data.frame(id = sample(1000, 1000000, replace = TRUE),
                  group = factor(rep(1:1000, each = 1000)),
                  value = runif(1000000))
DF3 <- within(DF3, idu <- factor(paste(id, group, sep = "_")))

I think the fastest way is to reorder the data frame and then use duplicated:

system.time({
  DF4 <- DF3[order(DF3$group), ]
  out2 <- DF4[!duplicated(DF4$group), ]
})
# user  system elapsed 
# 0.335   0.107   0.441

This compares to 7 seconds for Gavin's fastet lapply + split method on my computer.

Generally, when working with data frames, the fastest approach is usually to generate all the indices and then do a single subset.

answered Apr 28 '11 at 14:37

hadley

102,019
32
183
245

This is a nice approach but to add an additional note, real data may repeat group code as well, which requires a extra step: add a real single groupID to the entire dataset, maybe based on timestamp column – Paulo E. Cardoso Apr 29 '11 at 08:04
how come !duplicated returns the first value of a duplicated group? – zach Nov 09 '11 at 23:28
@zach If you read the help for `duplicated`, you can see that it's actually a pretty specific definition of "duplicate" - "duplicates of elements with a smaller subscript". So the first time a group ID is encountered, R is looking only at the previous records it's processed, not any of the duplicates ahead. So it returns FALSE, which Hadley's inverted. – Matt Parker Oct 17 '12 at 17:57

Gavin Simpson · Answer 2 · 2011-04-27T16:11:15.753

Update in light of OP's comment

If doing this on million+ rows, all options thus supplied will be slow. Here are some comparison timings on a dummy data set of 100,000 rows:

set.seed(12)
DF3 <- data.frame(id = sample(1000, 100000, replace = TRUE),
                  group = factor(rep(1:100, each = 1000)),
                  value = runif(100000))
DF3 <- within(DF3, idu <- factor(paste(id, group, sep = "_")))

> system.time(out1 <- do.call(rbind, lapply(split(DF3, DF3["group"]), `[`, 1, )))
   user  system elapsed 
 19.594   0.053  19.984 
> system.time(out3 <- aggregate(DF3[,-2], DF3["group"], function (x) x[1]))
   user  system elapsed 
 12.419   0.141  12.788

I gave up doing them with a million rows. Far faster, believe it or not, is:

out2 <- matrix(unlist(lapply(split(DF3[, -4], DF3["group"]), `[`, 1,)),
               byrow = TRUE, nrow = (lev <- length(levels(DF3$group))))
colnames(out2) <- names(DF3)[-4]
rownames(out2) <- seq_len(lev)
out2 <- as.data.frame(out2)
out2$group <- factor(out2$group)
out2$idu <- factor(paste(out2$id, out2$group, sep = "_"),
                   levels = levels(DF3$idu))

The outputs are (effectively) the same:

> all.equal(out1, out2)
[1] TRUE
> all.equal(out1, out3[, c(2,1,3,4)])
[1] "Attributes: < Component 2: Modes: character, numeric >"              
[2] "Attributes: < Component 2: target is character, current is numeric >"

(the difference between out1 (or out2) and out3 (the aggregate() version) is just in the rownames of the components.)

with a timing of:

   user  system elapsed 
  0.163   0.001   0.168

on the 100,000 row problem, and on this million row problem:

set.seed(12)
DF3 <- data.frame(id = sample(1000, 1000000, replace = TRUE),
                  group = factor(rep(1:1000, each = 1000)),
                  value = runif(1000000))
DF3 <- within(DF3, idu <- factor(paste(id, group, sep = "_")))

with a timing of

   user  system elapsed 
 11.916   0.000  11.925

Working with the matrix version (that produces out2) is quicker doing the million rows that the other versions are at doing the 100,000 row problem. This just shows that working with matrices is very quick indeed, and the bottleneck in the my do.call() version is rbind()-ing the result together.

The million row problem timing was done with:

system.time({out4 <- matrix(unlist(lapply(split(DF3[, -4], DF3["group"]),
                                          `[`, 1,)),
                            byrow = TRUE,
                            nrow = (lev <- length(levels(DF3$group))))
             colnames(out4) <- names(DF3)[-4]
             rownames(out4) <- seq_len(lev)
             out4 <- as.data.frame(out4)
             out4$group <- factor(out4$group)
             out4$idu <- factor(paste(out4$id, out4$group, sep = "_"),
                                levels = levels(DF3$idu))})

Original

If your data are in DF, say, then:

do.call(rbind, lapply(with(DF, split(DF, group)), head, 1))

will do what you want:

> do.call(rbind, lapply(with(DF, split(DF, group)), head, 1))
  idu group
1   1     1
2   4     2
3   7     3

If the new data are in DF2 then we get:

> do.call(rbind, lapply(with(DF2, split(DF2, group)), head, 1))
  id group idu value
1  1     1 1_1    34
2  4     2 4_2     6
3  1     3 1_3    34

But for speed, we probably want to subset instead of using head() and we can gain a bit by not using with(), eg:

do.call(rbind, lapply(split(DF2, DF2$group), `[`, 1, ))

> system.time(replicate(1000, do.call(rbind, lapply(split(DF2, DF2$group), `[`, 1, ))))
   user  system elapsed 
  3.847   0.040   4.044
> system.time(replicate(1000, do.call(rbind, lapply(split(DF2, DF2$group), head, 1))))
   user  system elapsed 
  4.058   0.038   4.111
> system.time(replicate(1000, aggregate(DF2[,-2], DF2["group"], function (x) x[1])))
   user  system elapsed 
  3.902   0.042   4.106

seems to work Gavin. I edited the content of this question but it may not be affected. I have to test its performance with a 2 milion lines data.frame. — Paulo E. Cardoso, Apr 27 '11 at 14:05
@Paulo I've updated the Answer, with some comparison timings for repeated runs on this collection of data. — Gavin Simpson, Apr 27 '11 at 14:54
@Paulo Cardosa I did some timings on a large problem and all options were slow, so I provided a version that works with a matrix and is much quicker. Timings on a million row problem included. — Gavin Simpson, Apr 27 '11 at 16:12
Very informative Gavin. I'll try with real data to see how it behaves when a DF have more columns as well. All this is critical since I have a 20 million lines object to work with and any time saving will have a huge impact on final computation. — Paulo E. Cardoso, Apr 29 '11 at 07:51
One additional requirement would be to retain **only** entries that match the nrows selection constrain (DF2$group entries that match the criteria). Could the code accommodate this? — Paulo E. Cardoso, Feb 28 '13 at 14:43

Daniel Dickison · Answer 3 · 2011-04-27T15:26:15.873

1

I think this will do the trick:

aggregate(data["idu"], data["group"], function (x) x[1])

For your updated question, I'd recommend using ddply from the plyr package:

ddply(data, .(group), function (x) x[1,])

edited Apr 27 '11 at 15:26

answered Apr 27 '11 at 13:58

Daniel Dickison

21,832
13
69
89

score 1 · Answer 4 · answered Apr 27 '11 at 14:07

One solution using plyr, assuming your data is in an object named zzz:

ddply(zzz, "group", function(x) x[1 ,])

Another option that takes the difference between rows and should prove faster, but relies on the object being ordered before hand. This also assumes you don't have a group value of 0:

zzz <- zzz[order(zzz$group) ,]

zzz[ diff(c(0,zzz$group)) != 0, ]

Subset a data frame based on column entry (or rank)

4 Answers4

Linked