Check if column 1 value appeared previously in dataset with a different column 2 value

Question

I have my dataset such that

df <- data.frame(ID = c("m1","m2","m3","m4","m5","m6","m2","m3","m5","m6","m1","m4","m5"),
                 Year = c(1,1,1,1,1,1,2,2,2,2,3,3,3))

and want to perform a check whether the ID appears in the previous year. Now I have a code that seems to work

df$Check <- apply(df, 1, function(x) x["ID"] %in% df[df$Year == (as.numeric(x["Year"]) - 1), "ID"])

but given that my dataset is 3million rows long this function takes far too long to run. Is there a better alternative to this??

score 4 · Accepted Answer · answered Dec 03 '16 at 15:57

4

Try

library(dplyr)
dfs <- split(df$ID, df$Year);
df$check <- unlist(mapply(`%in%`, dfs,  lag(dfs)))

answered Dec 03 '16 at 15:57

ExperimenteR

4,453
1
15
19

Cheers, works great! Just requires the dataset to be ordered by year beforehand – Gowzie Dec 03 '16 at 16:07
@Gowzie Yes, you need `df <- arrange(df, Year)` beforehand. Let me know how it worked out for your real data. – ExperimenteR Dec 03 '16 at 16:09

joel.wilson · Answer 2 · 2016-12-07T15:53:49.847

1

k = length(unique(df$Year))        # how many years in the data
q = unique(df$Year)                # which are the years present

func <- function(x){  
  kk = df$ID[df$Year == q[x]]      # get the current year's ID which are present
  kk %in% df$ID[df$Year == q[x-1]] # compare that to the previous year's ID
}

x <- sum(df$Year==unique(df$Year)[1]) #to know how many FALSE to be added initially
df$check <- c(rep(FALSE, x),unlist(lapply(2:k, func)))

edited Dec 07 '16 at 15:53

answered Dec 03 '16 at 16:08

joel.wilson

8,243
5
28
48

did it help you @Gowzie ? – joel.wilson Dec 05 '16 at 06:54
Unfortunately getting the error of the "replacement has 20 rows, data has 13" and haven't had time to look in to how to resolve this. But since @ExperimenteR solution worked so well I'm currently using his. – Gowzie Dec 05 '16 at 09:59

Henrik · Answer 3 · 2016-12-05T11:01:58.297

You may use ave: for each ID, calculate the difference between current Year and preceeding Year (diff). Pad with a leading zero. Check if the result is 1 to create a logical vector:

df$check2 <- with(df, ave(Year, ID, FUN = function(x) c(0, diff(x))) == 1)
#    ID Year check check2
# 1  m1    1 FALSE  FALSE
# 2  m2    1 FALSE  FALSE
# 3  m3    1 FALSE  FALSE
# 4  m4    1 FALSE  FALSE
# 5  m5    1 FALSE  FALSE
# 6  m6    1 FALSE  FALSE
# 7  m2    2  TRUE   TRUE
# 8  m3    2  TRUE   TRUE
# 9  m5    2  TRUE   TRUE
# 10 m6    2  TRUE   TRUE
# 11 m1    3 FALSE  FALSE
# 12 m4    3 FALSE  FALSE
# 13 m5    3  TRUE   TRUE

Similar with data.table:

For each ID (by = ID), create the new variable check2: check if the difference between current Year and preceeding Year in the data is 1 ((diff(year) == 1), i.e. if the preceeding year is the previous year.

library(data.table)
setDT(df)[ , Check2 := c(FALSE, diff(Year) == 1), by = ID]

Edit following comment by OP. In case of "multiple entries of the same ID in the same year", you perform the calulation on data where duplicated rows are removed (unique). Then join the result to the original data.

df2 <- unique(df)
df2[ , Check2 := c(FALSE, diff(Year) == 1), by = ID]
df[df2, on = c("ID", "Year")]

Good alternative, but fails when you have multiple entries of the same ID in the same year. Definitely easy to work around and a hell of a lot quicker than my original solution! — Gowzie, Dec 05 '16 at 10:03
@Gowzie Thanks for your feedback. Please remember to always construct an examples in your question which has sufficient complexity. We are not mind readers...;) — Henrik, Dec 05 '16 at 10:07

Check if column 1 value appeared previously in dataset with a different column 2 value

3 Answers3

Linked