0

I'm learning data.table and I would like to reproduce computations I realized with dplyr.

I have a dataframe with daily observations for individuals identified by 2 IDs. And I would like to keep for each individual the last observation.

For example with those dummy datas :

set.seed(42)
datas <- data.frame(id1 = rep(1:2, 2), 
                    id2 = rep(letters[1:2], 2), 
                    date = seq(Sys.Date() - 3, Sys.Date(), by = "day"), 
                    x = rnorm(4))

  id1 id2       date          x
1   1   a 2015-08-02  1.3709584
2   2   b 2015-08-03 -0.5646982
3   1   a 2015-08-04  0.3631284
4   2   b 2015-08-05  0.6328626

This is the output I'm expecting :

library(dplyr)
datas %>% 
  group_by(id1, id2) %>% 
  filter(date == max(date))

Source: local data frame [2 x 4]
Groups: id1, id2

  id1 id2       date         x
1   1   a 2015-08-04 0.3631284
2   2   b 2015-08-05 0.6328626

And this is my attempt with data.table :

library(data.table)

datas.dt <- data.table(datas)
tmp <- datas.dt[, list(date = max(date)), by = c("id1", "id2")]

setkey(datas.dt, id1, id2, date)
setkey(tmp, id1, id2, date)

datas.dt[tmp]

id1 id2       date         x
1:   1   a 2015-08-04 0.3631284
2:   2   b 2015-08-05 0.6328626

I'm calling a join so I think I'm missing something in the basics, but I couldn't find a more direct way to achieve this.

Is there a more straight way with data.table to achieve what I would like to do ? Thanks.

Julien Navarre
  • 7,653
  • 3
  • 42
  • 69
  • 1
    Perhaps you're looking for the `mult` argument? http://stackoverflow.com/questions/10474456/using-data-table-to-flag-the-first-or-last-record-in-a-group – Roman Luštrik Aug 05 '15 at 08:52
  • 1
    If you want to do something by groups, you will need the `by=...` inside the `[ ]`. See the [introduction to `data.table`](https://rawgit.com/wiki/Rdatatable/data.table/vignettes/datatable-intro-vignette.html) – Jaap Aug 05 '15 at 08:58
  • 1
    Maybe just `datas.dt[, .SD[which.max(date)], by = c("id1", "id2")]`? Or `datas.dt[, .SD[date == max(date)], by = c("id1", "id2")]` – David Arenburg Aug 05 '15 at 09:01
  • Thanks David. I didn't know how to keep all the columns and `.SD[which.max(date)]` is exactly what I was looking for. Add it as answer, I would be glad to accept it. – Julien Navarre Aug 05 '15 at 09:04

0 Answers0