3

I have my dataset such that

df <- data.frame(ID = c("m1","m2","m3","m4","m5","m6","m2","m3","m5","m6","m1","m4","m5"),
                 Year = c(1,1,1,1,1,1,2,2,2,2,3,3,3))

and want to perform a check whether the ID appears in the previous year. Now I have a code that seems to work

df$Check <- apply(df, 1, function(x) x["ID"] %in% df[df$Year == (as.numeric(x["Year"]) - 1), "ID"])

but given that my dataset is 3million rows long this function takes far too long to run. Is there a better alternative to this??

Henrik
  • 65,555
  • 14
  • 143
  • 159
Gowzie
  • 33
  • 3

3 Answers3

4

Try

library(dplyr)
dfs <- split(df$ID, df$Year);
df$check <- unlist(mapply(`%in%`, dfs,  lag(dfs)))
ExperimenteR
  • 4,453
  • 1
  • 15
  • 19
1
k = length(unique(df$Year))        # how many years in the data
q = unique(df$Year)                # which are the years present

func <- function(x){  
  kk = df$ID[df$Year == q[x]]      # get the current year's ID which are present
  kk %in% df$ID[df$Year == q[x-1]] # compare that to the previous year's ID
}

x <- sum(df$Year==unique(df$Year)[1]) #to know how many FALSE to be added initially
df$check <- c(rep(FALSE, x),unlist(lapply(2:k, func)))
joel.wilson
  • 8,243
  • 5
  • 28
  • 48
  • did it help you @Gowzie ? – joel.wilson Dec 05 '16 at 06:54
  • Unfortunately getting the error of the "replacement has 20 rows, data has 13" and haven't had time to look in to how to resolve this. But since @ExperimenteR solution worked so well I'm currently using his. – Gowzie Dec 05 '16 at 09:59
1

You may use ave: for each ID, calculate the difference between current Year and preceeding Year (diff). Pad with a leading zero. Check if the result is 1 to create a logical vector:

df$check2 <- with(df, ave(Year, ID, FUN = function(x) c(0, diff(x))) == 1)
#    ID Year check check2
# 1  m1    1 FALSE  FALSE
# 2  m2    1 FALSE  FALSE
# 3  m3    1 FALSE  FALSE
# 4  m4    1 FALSE  FALSE
# 5  m5    1 FALSE  FALSE
# 6  m6    1 FALSE  FALSE
# 7  m2    2  TRUE   TRUE
# 8  m3    2  TRUE   TRUE
# 9  m5    2  TRUE   TRUE
# 10 m6    2  TRUE   TRUE
# 11 m1    3 FALSE  FALSE
# 12 m4    3 FALSE  FALSE
# 13 m5    3  TRUE   TRUE

Similar with data.table:

For each ID (by = ID), create the new variable check2: check if the difference between current Year and preceeding Year in the data is 1 ((diff(year) == 1), i.e. if the preceeding year is the previous year.

library(data.table)
setDT(df)[ , Check2 := c(FALSE, diff(Year) == 1), by = ID]

Edit following comment by OP. In case of "multiple entries of the same ID in the same year", you perform the calulation on data where duplicated rows are removed (unique). Then join the result to the original data.

df2 <- unique(df)
df2[ , Check2 := c(FALSE, diff(Year) == 1), by = ID]
df[df2, on = c("ID", "Year")] 
Henrik
  • 65,555
  • 14
  • 143
  • 159
  • Good alternative, but fails when you have multiple entries of the same ID in the same year. Definitely easy to work around and a hell of a lot quicker than my original solution! – Gowzie Dec 05 '16 at 10:03
  • @Gowzie Thanks for your feedback. Please remember to always construct an examples in your question which has sufficient complexity. We are not mind readers...;) – Henrik Dec 05 '16 at 10:07