119

From a dataframe like this

test <- data.frame('id'= rep(1:5,2), 'string'= LETTERS[1:10])
test <- test[order(test$id), ]
rownames(test) <- 1:10

> test
    id string
 1   1      A
 2   1      F
 3   2      B
 4   2      G
 5   3      C
 6   3      H
 7   4      D
 8   4      I
 9   5      E
 10  5      J

I want to create a new one with the first row of each id / string pair. If sqldf accepted R code within it, the query could look like this:

res <- sqldf("select id, min(rownames(test)), string 
              from test 
              group by id, string")

> res
    id string
 1   1      A
 3   2      B
 5   3      C
 7   4      D
 9   5      E

Is there a solution short of creating a new column like

test$row <- rownames(test)

and running the same sqldf query with min(row)?

Henrik
  • 65,555
  • 14
  • 143
  • 159
dmvianna
  • 15,088
  • 18
  • 77
  • 106

8 Answers8

139

You can use duplicated to do this very quickly.

test[!duplicated(test$id),]

Benchmarks, for the speed freaks:

ju <- function() test[!duplicated(test$id),]
gs1 <- function() do.call(rbind, lapply(split(test, test$id), head, 1))
gs2 <- function() do.call(rbind, lapply(split(test, test$id), `[`, 1, ))
jply <- function() ddply(test,.(id),function(x) head(x,1))
jdt <- function() {
  testd <- as.data.table(test)
  setkey(testd,id)
  # Initial solution (slow)
  # testd[,lapply(.SD,function(x) head(x,1)),by = key(testd)]
  # Faster options :
  testd[!duplicated(id)]               # (1)
  # testd[, .SD[1L], by=key(testd)]    # (2)
  # testd[J(unique(id)),mult="first"]  # (3)
  # testd[ testd[,.I[1L],by=id] ]      # (4) needs v1.8.3. Allows 2nd, 3rd etc
}

library(plyr)
library(data.table)
library(rbenchmark)

# sample data
set.seed(21)
test <- data.frame(id=sample(1e3, 1e5, TRUE), string=sample(LETTERS, 1e5, TRUE))
test <- test[order(test$id), ]

benchmark(ju(), gs1(), gs2(), jply(), jdt(),
    replications=5, order="relative")[,1:6]
#     test replications elapsed relative user.self sys.self
# 1   ju()            5    0.03    1.000      0.03     0.00
# 5  jdt()            5    0.03    1.000      0.03     0.00
# 3  gs2()            5    3.49  116.333      2.87     0.58
# 2  gs1()            5    3.58  119.333      3.00     0.58
# 4 jply()            5    3.69  123.000      3.11     0.51

Let's try that again, but with just the contenders from the first heat and with more data and more replications.

set.seed(21)
test <- data.frame(id=sample(1e4, 1e6, TRUE), string=sample(LETTERS, 1e6, TRUE))
test <- test[order(test$id), ]
benchmark(ju(), jdt(), order="relative")[,1:6]
#    test replications elapsed relative user.self sys.self
# 1  ju()          100    5.48    1.000      4.44     1.00
# 2 jdt()          100    6.92    1.263      5.70     1.15
Joshua Ulrich
  • 173,410
  • 32
  • 338
  • 418
  • The winner: system.time(dat3[!duplicated(dat3$id),]) user system elapsed 0.07 0.00 0.07 – dmvianna Nov 07 '12 at 23:18
  • 2
    @dmvianna: I don't have it installed and didn't feel like bothering with it. :) – Joshua Ulrich Nov 07 '12 at 23:33
  • Are we sure that my data.table code is as efficient as possible? I'm not confident in my ability to eke the best performance out of that tool. – joran Nov 07 '12 at 23:36
  • @joran Geez S.O. moves quickly. That's pretty inefficient. Without a key just `DT[,.SD[1L],by=id]`. Or `setkey` and the same, or `DT[J(unique(id)),mult="first"]`. But `DT[!duplicated(id)]` is probably fastest for this case, just like `ju()`. But `lapply`'ing a S3 genric (`head`), by group, is going to be _slow_. – Matt Dowle Nov 08 '12 at 00:14
  • 2
    Also, I reckon, if you are going to benchmark the data.table , keying you should include the ordering by id within the base calls. – mnel Nov 08 '12 at 00:42
  • @mnel Oh yes, I missed that! – Matt Dowle Nov 08 '12 at 00:44
  • @MatthewDowle, I've benchmarked the `mult = 'first'` solution without the key / order in both in my answer – mnel Nov 08 '12 at 00:47
  • @MatthewDowle: please don't edit your questions into answers, just correct the errors or comment and wait for a response. – Joshua Ulrich Nov 08 '12 at 02:23
  • @mnel: I guess I assumed that setting the key on a column that was already ordered would be near-costless. – Joshua Ulrich Nov 08 '12 at 02:55
  • 1
    @JoshuaUlrich One more question: why is the first sentence needed i.e. assumption that data is already sorted. `!duplicated(x)` finds the first of each group even if it isn't sorted, iiuc. – Matt Dowle Nov 08 '12 at 09:24
  • @MatthewDowle: You're very correct, it isn't needed. Great catch! – Joshua Ulrich Nov 08 '12 at 15:04
  • Hey Josh, you should really add a README.md to your Github repos, you do some really cool stuff!! Btw, your speed test is very helpful (and instructive). R is moving more and more towards big data... heard of SparkR? – warship Jan 28 '16 at 06:25
86

I favor the dplyr approach.

group_by(id) followed by either

  • filter(row_number()==1) or
  • slice(1) or
  • slice_head(1) #(dplyr => 1.0)
  • top_n(n = -1)
    • top_n() internally uses the rank function. Negative selects from the bottom of rank.

In some instances arranging the ids after the group_by can be necessary.

library(dplyr)

# using filter(), top_n() or slice()

m1 <-
test %>% 
  group_by(id) %>% 
  filter(row_number()==1)

m2 <-
test %>% 
  group_by(id) %>% 
  slice(1)

m3 <-
test %>% 
  group_by(id) %>% 
  top_n(n = -1)

All three methods return the same result

# A tibble: 5 x 2
# Groups:   id [5]
     id string
  <int> <fct> 
1     1 A     
2     2 B     
3     3 C     
4     4 D     
5     5 E
Kresten
  • 1,758
  • 12
  • 18
  • 3
    Worth giving a shout-out to `slice` as well. `slice(x)` is a shortcut for `filter(row_number() %in% x)`. – Gregor Thomas Jun 20 '18 at 18:53
  • Very elegant. Do you know why I have to convert my `data.table` to a `data.frame` for this to work? – James Hirschorn Jan 13 '19 at 03:09
  • @JamesHirschorn I'm not an expert on the all the differences. But `data.table` inherits from the `data.frame` so in many cases you can use dplyr commands on a `data.table`. The example above e.g also works if `test` is a `data.table`. See e.g. https://stackoverflow.com/questions/13618488/what-you-can-do-with-a-data-frame-that-you-cant-with-a-data-table for a deeper explanantion – Kresten Jan 14 '19 at 09:06
  • This is a tidyverse way to do it and as you see the data.frame is actually a tibble here. I personally advise you to work always with tibbles also because ggplot2 is built in a similar manner. – Garini Jan 15 '20 at 10:28
19

What about

DT <- data.table(test)
setkey(DT, id)

DT[J(unique(id)), mult = "first"]

Edit

There is also a unique method for data.tables which will return the the first row by key

jdtu <- function() unique(DT)

I think, if you are ordering test outside the benchmark, then you can removing the setkey and data.table conversion from the benchmark as well (as the setkey basically sorts by id, the same as order).

set.seed(21)
test <- data.frame(id=sample(1e3, 1e5, TRUE), string=sample(LETTERS, 1e5, TRUE))
test <- test[order(test$id), ]
DT <- data.table(DT, key = 'id')
ju <- function() test[!duplicated(test$id),]

jdt <- function() DT[J(unique(id)),mult = 'first']


 library(rbenchmark)
benchmark(ju(), jdt(), replications = 5)
##    test replications elapsed relative user.self sys.self 
## 2 jdt()            5    0.01        1      0.02        0        
## 1  ju()            5    0.05        5      0.05        0         

and with more data

** Edit with unique method**

set.seed(21)
test <- data.frame(id=sample(1e4, 1e6, TRUE), string=sample(LETTERS, 1e6, TRUE))
test <- test[order(test$id), ]
DT <- data.table(test, key = 'id')
       test replications elapsed relative user.self sys.self 
2  jdt()            5    0.09     2.25      0.09     0.00    
3 jdtu()            5    0.04     1.00      0.05     0.00      
1   ju()            5    0.22     5.50      0.19     0.03        

The unique method is fastest here.

mnel
  • 113,303
  • 27
  • 265
  • 254
  • 5
    You don't even have to set the key. `unique(DT,by="id")` works directly – Matthew Sep 02 '14 at 01:43
  • FYI as of `data.table` version >= 1.9.8, the default `by` argument for `unique` is `by = seq_along(x)` (all columns), instead of the previous default `by = key(x)` – IceCreamToucan Dec 31 '18 at 21:03
16

If speed is an issue, a similar approach could be taken with data.table:

testd <- data.table(test)
testd[, .SD[1], by = id]

or this might be considerably faster:

testd[testd[, .I[1], by = id]$V1]

A simple ddply option:

ddply(test,.(id),function(x) head(x,1))
Maël
  • 45,206
  • 3
  • 29
  • 67
joran
  • 169,992
  • 32
  • 429
  • 468
  • Surprisingly, sqldf does it faster: 1.77 0.13 1.92 vs 10.53 0.00 10.79 with data.table – dmvianna Nov 07 '12 at 23:21
  • 3
    @dmvianna I wouldn't necessarily count out data.table. I'm not an expert with that tool, so my data.table code may not be the most efficient way to go about that. – joran Nov 07 '12 at 23:26
  • I upvoted this prematurely. When I ran it on a large data.table, it was ridiculously slow and it didn't work: the number of rows was the same after. – James Hirschorn Jan 13 '19 at 03:07
  • @JamesHirachorn I wrote this a long time ago, the package has changed a lot, and I hardly use data.table at all. If you find the right way to do this with that package, feel free to suggest an edit to make it better. – joran Jan 13 '19 at 05:04
9

now, for dplyr, adding a distinct counter.

df %>%
    group_by(aa, bb) %>%
    summarise(first=head(value,1), count=n_distinct(value))

You create groups, them summarise within groups.

If data is numeric, you can use:
first(value) [there is also last(value)] in place of head(value, 1)

see: http://cran.rstudio.com/web/packages/dplyr/vignettes/introduction.html

Full:

> df
Source: local data frame [16 x 3]

   aa bb value
1   1  1   GUT
2   1  1   PER
3   1  2   SUT
4   1  2   GUT
5   1  3   SUT
6   1  3   GUT
7   1  3   PER
8   2  1   221
9   2  1   224
10  2  1   239
11  2  2   217
12  2  2   221
13  2  2   224
14  3  1   GUT
15  3  1   HUL
16  3  1   GUT

> library(dplyr)
> df %>%
>   group_by(aa, bb) %>%
>   summarise(first=head(value,1), count=n_distinct(value))

Source: local data frame [6 x 4]
Groups: aa

  aa bb first count
1  1  1   GUT     2
2  1  2   SUT     2
3  1  3   SUT     3
4  2  1   221     3
5  2  2   217     3
6  3  1   GUT     2
Paul
  • 3,920
  • 31
  • 29
  • This answer is quite dated - there are better ways to do this with `dplyr` that don't require writing a statement for every single column to be included (see atomman's answer below, for example)`. Also I'm not sure what *"if data is numeric"* has anything to do with whether or not one would use `first(value)` vs `head(value)` (or just `value[1]`) – Gregor Thomas Jun 20 '18 at 18:56
7

(1) SQLite has a built in rowid pseudo-column so this works:

sqldf("select min(rowid) rowid, id, string 
               from test 
               group by id")

giving:

  rowid id string
1     1  1      A
2     3  2      B
3     5  3      C
4     7  4      D
5     9  5      E

(2) Also sqldf itself has a row.names= argument:

sqldf("select min(cast(row_names as real)) row_names, id, string 
              from test 
              group by id", row.names = TRUE)

giving:

  id string
1  1      A
3  2      B
5  3      C
7  4      D
9  5      E

(3) A third alternative which mixes the elements of the above two might be even better:

sqldf("select min(rowid) row_names, id, string 
               from test 
               group by id", row.names = TRUE)

giving:

  id string
1  1      A
3  2      B
5  3      C
7  4      D
9  5      E

Note that all three of these rely on a SQLite extension to SQL where the use of min or max is guaranteed to result in the other columns being chosen from the same row. (In other SQL-based databases that may not be guaranteed.)

G. Grothendieck
  • 254,981
  • 17
  • 203
  • 341
  • Thanks! This is much better than the accepted answer IMO because it's generalizable to taking the first/last element in an aggregate step using multiple aggregate functions (i.e. take the first of this variable, sum that variable, etc). – Bridgeburners Mar 10 '15 at 17:01
5

A base R option is the split()-lapply()-do.call() idiom:

> do.call(rbind, lapply(split(test, test$id), head, 1))
  id string
1  1      A
2  2      B
3  3      C
4  4      D
5  5      E

A more direct option is to lapply() the [ function:

> do.call(rbind, lapply(split(test, test$id), `[`, 1, ))
  id string
1  1      A
2  2      B
3  3      C
4  4      D
5  5      E

The comma-space 1, ) at the end of the lapply() call is essential as this is equivalent of calling [1, ] to select first row and all columns.

Gavin Simpson
  • 170,508
  • 25
  • 396
  • 453
  • This was very slow, Gavin: user system elapsed 91.84 6.02 101.10 – dmvianna Nov 07 '12 at 23:22
  • Anything involving data frames will be. Their utility comes at a price. Hence data.table, for example. – Gavin Simpson Nov 07 '12 at 23:32
  • in my defence, and R's, you didn't mention anything about efficiency in the question. Often ease of use *is* a feature. Witness the popularity of ply, which is "slow" too, at least until the next version that has data.table support. – Gavin Simpson Nov 08 '12 at 07:10
  • 1
    I agree. I didn't mean to insult you. I did find, though, that @Joshua-Ulrich's method was __both__ fast and easy. :7) – dmvianna Nov 08 '12 at 09:04
  • No need to apologise and I didn't take it as an insult. Was just pointing out that it was offered without any claim of efficiency. Remember this [so] Q&A is not just for your benefit but that of other users who come across your question as they have a similar problem. – Gavin Simpson Nov 08 '12 at 09:09
0

A very fast option is collapse::ffirst:

library(collapse)
ffirst(test, g = test$id)

#   id string
# 1  1      A
# 2  2      B
# 3  3      C
# 4  4      D
# 5  5      E

A more recent dplyr answer is to use inline grouping with slice_head's by argument:

library(dplyr)
slice_head(test, n = 1, by = id)

Comparing with data.table, on a data set with 1,000,000 rows and 10,000 groups, collapse is almost twice as fast:

Unit: milliseconds
     expr     min       lq     mean   median       uq      max neval
 collapse  8.8234 10.31675 13.27663 11.85590 14.59135  35.9251   100
       DT 17.0479 19.35955 24.61700 21.34465 24.61960 172.5803   100
      DT2 10.5810 13.03335 23.65378 21.70410 26.26575 195.0825   100

Code

set.seed(21)
library(collapse)
library(data.table)
library(dplyr)
test <- data.frame(id=sample(1e4, 1e6, TRUE), string=sample(LETTERS, 1e6, TRUE))
test <- test[order(test$id), ]
DT <- data.table(test, key = 'id')

library(microbenchmark)
microbenchmark(
  collapse = ffirst(test, g = test$id),
  DT = DT[J(unique(DT, by = "id")), mult = "first"],
  DT2 = DT[DT[, .I[1], by = id]$V1]
)
Maël
  • 45,206
  • 3
  • 29
  • 67