Keep observations for ID that have multiple years of data

Question

I am working with a dataset in R and would like to keep the ID numbers where there is more than 1 year of data available.

With the picture as a reference, I would like to keep the rows where the ID number is 1 or 2 (since they have more than 1 year of observed data), but remove those with ID number 3 (since data was only observed in one year).

How can I do this easily in R? My thought was to loop across the row numbers and create a dummy variable where the condition I need is met. I was thinking of having a 1 where the difference in the ID number is 0 and the difference in the year is not 0. This would allow me to identify the ID numbers I need to keep.

for(i in 1:nrow(agg_data_condensed)){
  agg_data_condensed = mutate(dum = case_when((agg_data_condensed[i,1]-agg_data_condensed[i-1,1] == 0) & (agg_data_condensed[i,7]-agg_data_condensed[i-1,7] != 0) ~ 1 ))
}

However, this is not giving me what I want. It is actually giving me the error "Error in UseMethod("mutate") : no applicable method for 'mutate' applied to an object of class "c('double', 'numeric')".

Any help would be greatly appreciated!

Edit: here is the output from the dput function

structure(list(ID = c(1, 1, 1, 2, 2, 2, 3), Year = c(2005, 2006, 
2007, 2005, 2006, 2006, 2008)), row.names = c(NA, -7L), class = c("tbl_df", 
"tbl", "data.frame"))

Please don't post data as images. Take a look at how to make a [great reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) for ways of showing data. The gold standard for providing data is using `dput(head(NameOfYourData))`, *editing* your question and putting the `structure()` output into the question. — Martin Gal, Sep 06 '21 at 17:06

Martin Gal · Accepted Answer · 2021-09-06T17:18:49.830

You could use

library(dplyr)

df %>% 
  group_by(ID) %>% 
  filter(n_distinct(Year) > 1) %>%
  ungroup()

This returns

# A tibble: 6 x 2
     ID  Year
  <dbl> <dbl>
1     1  2005
2     1  2006
3     1  2007
4     2  2005
5     2  2006
6     2  2006

The n_distinct() function doesn't cound ID 2's year 2006 twice. So if you want it to be counted twice, replace n_distinct(Year) by n().

Data

df <- data.frame(ID = c(1, 1, 1, 2, 2, 2, 3),
                 Year = c(2005, 2006, 2007, 2005, 2006, 2006, 2008))

score 1 · Answer 2 · answered Sep 06 '21 at 17:15

1

Base R rendering of MartinGal's dplyr answer:

df[ave(df$Year, df$ID, FUN = function(z) length(unique(z)) > 1) > 0,]
#   ID Year
# 1  1 2005
# 2  1 2006
# 3  1 2007
# 4  2 2005
# 5  2 2006
# 6  2 2006

answered Sep 06 '21 at 17:15

r2evans

141,215
6
77
149

score 1 · Answer 3 · answered Sep 06 '21 at 17:22

1

data.table solution;

library(data.table)
setDT(df)
df <- df[,frq:=.N,by=ID][frq>1]
df[,frq:=NULL]

output;

     ID  Year
  <dbl> <dbl>
1     1  2005
2     1  2006
3     1  2007
4     2  2005
5     2  2006
6     2  2006

answered Sep 06 '21 at 17:22

Samet Sökel

2,515
6
21

Keep observations for ID that have multiple years of data

3 Answers3

Data