R subset of data frame for cohort analysis

Question

I have following data in the data frame df

persons  year
personA  2015
personB  2016
personC  2015
personB  2015

how do I use subset function in R to filter personB who is in 2015 and 2016 I am using following Code, but does not work

df1 <- subset(df, (year==2015 & year ==2016))

i just corrected the dataframe. i want to get persons who are in both 2015 and 2016. personB is in both data frame. — rstudy12, Mar 29 '17 at 21:30
You're close. This topic is covered extensively on SO and you may find a relevant example (e.g. [here](http://stackoverflow.com/questions/4935479/how-to-combine-multiple-conditions-to-subset-a-data-frame-using-or)) — metasequoia, Mar 29 '17 at 21:42

score 1 · Accepted Answer · answered Mar 29 '17 at 21:43

1

I'd use dplyr for this as it's much easier than in base R.

library(dplyr)
df %>% group_by(persons) %>% filter(n() == 2)

This groups the rows by person and then retains only groups with two members (both years).

answered Mar 29 '17 at 21:43

Joe

8,073
1
52
58

Just because `n()==2` doesn't mean they are both in 2015 and 2016. There could be two records for one year or records for two different years or they might be in 2015, 2106, and 2014 and would be missed. – MrFlick Mar 29 '17 at 21:45
Absolutely agree. I will think of editing to cover these cases. – Joe Mar 29 '17 at 21:45
I think `df %>% group_by(persons) %>% filter(all(2015:2016 %in% year))` should do the trick to require a value for those specific years (but still allow possible duplicates). – MrFlick Mar 29 '17 at 21:51
Nice, you beat me to it. – Joe Mar 29 '17 at 21:54

Fadwa · Answer 2 · 2017-03-30T13:36:04.867

0

df2 <- df[(df$year== 2015 | df$year== 2016),][1]

## get each person and the number of his appearence in the dataframe
t <- table(df2)
# 
# personA personB personC 
# 1       2       1 

t[t>1]
# personB 
# 2

The dataframe

df <- data.frame("persons" = c("personA","personB","personC","personB"),
 "year" = c(2015,2016,2015,2015))

EDIT

Another solution using duplicated

 duplicated(df$persons)
#[1] FALSE FALSE FALSE  TRUE
 df[duplicated(df$persons),1]
# personB

edited Mar 30 '17 at 13:36

answered Mar 29 '17 at 21:28

Fadwa

1,717
5
26
43

i want persons who are in both 2015 and 2016. i dont want to filter based on persons – rstudy12 Mar 29 '17 at 21:32

Kristoffer Winther Balling · Answer 3 · 2017-03-29T21:56:05.013

An example using data.table (and unique to handle multiple rows of same person in same year):

library(data.table)
dt <- structure(list(persons = c("personA", "personB", "personC", "personB"
), year = c(2015L, 2016L, 2015L, 2015L)), .Names = c("persons", 
"year"), row.names = c(NA, -4L), class = "data.frame")
setDT(dt)
years <- c("2015", "2016")
# Filter by years and make sure all rows are unique combinations of persons and
# thoese years. Then set in_all_years to TRUE of number of rows is equal to
# number of years
out <- unique(dt[year %in% years])[, in_all_years := .N == length(years),
  by = persons]

> out
   persons year in_all_years
1: personA 2015        FALSE
2: personB 2016         TRUE
3: personC 2015        FALSE
4: personB 2015         TRUE

R subset of data frame for cohort analysis

3 Answers3