1

I'd need a function that checks my data frame for an otherwise identical row, but with date minus 1, and returns true if it exists. It's a large data frame, so I'd like to do it as efficiently as possible.

For example, take the following data frame:

name  |date       
Timmy |01/Jan/2016
Timmy |02/Jan/2016
Timmy |03/Jan/2016
Sally |04/Jan/2016
Johnny|13/Feb/2016
Johnny|29/Mar/2016

The function should see Timmy|02/Jan/2016, check if Timmy|01/Jan/2016 exists, and return true. The resulting data frame would look like this:

name  |date       |hasDateMinusOne
Timmy |01/Jan/2016|false
Timmy |02/Jan/2016|true
Timmy |03/Jan/2016|true
Sally |04/Jan/2016|false
Johnny|13/Feb/2016|false
Johnny|29/Mar/2016|false

This is the closest answer I've found. Although it was answered by Hadley, it's 5 years old and predates dplyr. I'm wondering if it's still the most efficient way to handle 1,000,000+ rows.

Thanks!

Sean

Community
  • 1
  • 1
Sean G
  • 385
  • 1
  • 3
  • 12

2 Answers2

2

If you format date as a date, you can just subtract one:

library(dplyr)

df %>% group_by(name) %>% 
    mutate(date = as.Date(date, '%d/%b/%Y'), 
           hasDateMinusOne = (date - 1) %in% date)

# Source: local data frame [6 x 3]
# Groups: name [3]
# 
#     name       date hasDateMinusOne
#   (fctr)     (date)           (lgl)
# 1  Timmy 2016-01-01           FALSE
# 2  Timmy 2016-01-02            TRUE
# 3  Timmy 2016-01-03            TRUE
# 4  Sally 2016-01-04           FALSE
# 5 Johnny 2016-02-13           FALSE
# 6 Johnny 2016-03-29           FALSE
alistaire
  • 42,459
  • 4
  • 77
  • 117
1

We can do this using only base R. Convert the 'date' to 'Date' class using transform, then with ave we group by 'name' and find whether the day before is found %in% the 'date' column.

df <- transform(df, date = as.Date(date, "%d/%b/%Y"))
df$hasDateMinusOne <- with(df, !!ave(as.integer(date), name,
            FUN = function(x) (x-1) %in% x))

If efficiency matters, another option is data.table. Convert the 'data.frame' to 'data.table' (setDT(df)), change 'date' to 'Date' class, grouped by 'name' we find whether the previous day is found %in% the 'date' column.

setDT(df)[, date := as.Date(date, '%d/%b/%Y') 
     ][, hasDateMinusOne := (date-1) %in% date, by =  name]
df
#     name       date hasDateMinusOne
#1:  Timmy 2016-01-01           FALSE
#2:  Timmy 2016-01-02            TRUE
#3:  Timmy 2016-01-03            TRUE
#4:  Sally 2016-01-04           FALSE
#5: Johnny 2016-02-13           FALSE
#6: Johnny 2016-03-29           FALSE
akrun
  • 874,273
  • 37
  • 540
  • 662