14

I have a dataset where each individual (id) has an e_date, and since each individual could have more than one e_date, I'm trying to get the earliest date for each individual. So basically I would like to have a dataset with one row per each id showing his earliest e_date value. I've use the aggregate function to find the minimum values, I've created a new variable combining the date and the id and last I've subset the original dataset based on the one containing the minimums using the new variable created. I've come to this:

new <- aggregate(e_date ~ id, data_full, min)

data_full["comb"] <- NULL
data_full$comb <- paste(data_full$id,data_full$e_date)

new["comb"] <- NULL
new$comb <- paste(new$lopnr,new$EDATUM)

data_fixed <- data_full[which(new$comb %in% data_full$comb),]

The first thing is that the aggregate function doesn't seems to work at all, it reduces the number of rows but viewing the data I can clearly see that some ids appear more than once with different e_date. Plus, the code gives me different results when I use the as.Date format instead of its original format for the date (integer). I think the answer is simple but I'm struck on this one.

pietro
  • 143
  • 1
  • 1
  • 5
  • It would be good to provide reproducible data especially when working with dates. I created my own example and `aggregate` with `min` worked fine, so you really need to `dput` the data. http://stackoverflow.com/help/mcve See also the guidelines for sharing data in the R tag description. – Hack-R Aug 11 '16 at 10:27
  • What is the format of `e_date` column ? Also have a look at [this](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) – Ronak Shah Aug 11 '16 at 10:29

4 Answers4

27

We can use data.table. Convert the 'data.frame' to 'data.table' (setDT(data_full)), grouped by 'id', we get the 1st row (head(.SD, 1L)).

library(data.table)
setDT(data_full)[order(e_date), head(.SD, 1L), by = id]

Or using dplyr, after grouping by 'id', arrange the 'e_date' (assuming it is of Date class) and get the first row with slice.

library(dplyr)
data_full %>%
    group_by(id) %>%
    arrange(e_date) %>%
    slice(1L)

If we need a base R option, ave can be used

data_full[with(data_full, ave(e_date, id, FUN = function(x) rank(x)==1)),]
akrun
  • 874,273
  • 37
  • 540
  • 662
  • Could you provide a reproducible example of the base R option please? tried on a couple of dataframes but get the error: "Error in as.Date.default(value) : do not know how to convert 'value' to class “Date”" The date var I'm using is definitely a date var. Thanks – L Tyrone Aug 25 '17 at 02:37
  • @LeroyTyrone Wouldn't it be better to ask the OP for that – akrun Aug 25 '17 at 05:15
11

Another answer that uses dplyr's filter command:

dta %>% 
  group_by(id) %>%
  filter(date == min(date))
greg_s
  • 134
  • 1
  • 4
2

You may use library(sqldf) to get the minimum date as follows:

data1<-data.frame(id=c("789","123","456","123","123","456","789"),
                  e_date=c("2016-05-01","2016-07-02","2016-08-25","2015-12-11","2014-03-01","2015-07-08","2015-12-11"))  

library(sqldf)
data2 = sqldf("SELECT id,
                    min(e_date) as 'earliest_date'
                    FROM data1 GROUP BY 1", method = "name__class")    

head(data2)   

id   earliest_date   
123    2014-03-01      
456    2015-07-08   
789    2015-12-11  
AndrewGB
  • 16,126
  • 5
  • 18
  • 49
1

I made a reproducible example, supposing that you grouped some dates by which quarter they were in.

library(lubridate)
library(dplyr)
rand_weeks <- now() + weeks(sample(100))
which_quarter <- quarter(rand_weeks)
df <- data.frame(rand_weeks, which_quarter)

df %>%
  group_by(which_quarter) %>% summarise(sort(rand_weeks)[1])

# A tibble: 4 x 2
  which_quarter sort(rand_weeks)[1]
          <dbl>              <time>
1             1 2017-01-05 05:46:32
2             2 2017-04-06 05:46:32
3             3 2016-08-18 05:46:32
4             4 2016-10-06 05:46:32
shayaa
  • 2,787
  • 13
  • 19