R: dataframe Selecting the maximum row per ID based on first timestamp

Question

I have a data frame which contains records that have time stamps. The toy example below contains an ID with 2 SMS's attached to it based on two different time stamps. In reality there would be thousands of IDs each with almost 80-100 SMS Types and dates

toydf <- data.frame(ID = c(1045937900, 1045937900), 
                    SMS.Type = c("DF1", "WCB14"), 
                    SMS.Date = c("12/02/2015 19:51", "13/02/2015 08:38"))

I want to be able to create a new dataframe that only contains the the record of the SMS type for the first SMS.Date or even the last

I have had a look at using duplicated, I have also thought about sorting the date column in descending order per ID and adding a new column which puts a 1 next to the first instance of the ID and a zero if the current ID is equal to the previous ID. I suspect this will get heavy if the number of records increases dramatically

Does anyone know a more elegant way of doing this - maybe using data.table

Thanks for your time

akrun · Accepted Answer · 2015-05-25T13:04:03.810

1

Try

library(dplyr)
toydf %>% 
   group_by(ID) %>%
   arrange(desc(as.POSIXct(SMS.Date, format='%d/%m/%Y %H:%M'))) %>% 
   slice(1L)

Or using data.table

library(data.table)
toydf$SMS.Date <- as.POSIXct(toydf$SMS.Date, format='%d/%m/%Y %H:%M')
setkey(setDT(toydf), ID, SMS.Date)[, .SD[.N], ID]

edited May 25 '15 at 13:04

answered May 25 '15 at 12:58

akrun

874,273
37
540
662

@JohnSmith Glad to know that it helped. Using `.I` would be a bit more faster with `data.table`. It is always good to post at least 6-10 lines for testing purposes – akrun May 25 '15 at 13:17

R: dataframe Selecting the maximum row per ID based on first timestamp

1 Answers1