0

I use a dataset with 16 variables and 80.000 observations.

The variable "syear" describes the year of the observation (2008,2012,2016). The variable "pid" describes the unique person ID.

As you can see in the screenshot, it is possible that persons only participated in one or two years. I only want to keep observations from persons, who participated in all three years. In the screenshots this would be the pid 901 and 1501.

How do I filter my dataset by this condition?

pid and year

Scasto
  • 34
  • 3
  • 1
    Welcome to SO and thanks for your question. Please add a [reprex](https://stackoverflow.com/help/minimal-reproducible-example) next time. In R you can create this easily using the [reprex package](https://github.com/tidyverse/reprex) – BerriJ Jan 05 '21 at 14:03

2 Answers2

0

You can try this:

library(tidyverse)

df <- tribble(
  ~syear, ~pid,
  2008,201,
  2008,203,
  2008,602,
  2012,602,
  2008,604,
  2008,901,
  2012,901,
  2016,901,
  2008,1501,
  2012,1501,
  2016,1501
)

df %>% 
  group_by(pid) %>%
  mutate(cnt = n()) %>%
  filter(cnt == 3)

# alternatively, the cnt column can be dropped
df %>% 
  group_by(pid) %>%
  mutate(cnt = n()) %>%
  filter(cnt == 3) %>%
  select(-cnt)
nyk
  • 670
  • 5
  • 11
  • Worked! So you group the df by the pid and then you count how many observations are in the groups. Afterwards you filter to get only the groups with 3 observations? I don't really get how the mutate function works in this case. Especially what does n() mean? – Scasto Jan 05 '21 at 12:37
  • @Scasto this is based on the `dplyr` package. `mutate` is a transformation function. `n()` is a function for count. So, `mutate(cnt = n())` means create a new column based on the number of occurrence in a particular group (i.e. pid). You can read more about `dplyr` [here](https://r4ds.had.co.nz/transform.html) – nyk Jan 05 '21 at 12:51
0

As a simplification of nyk's answer you could also do this:

library(dplyr)
library(conflicted)

conflict_prefer("filter", "dplyr")
#> [conflicted] Will prefer dplyr::filter over any other package

tibble(
    year = c(2001, 2002, 2003, 2001, 2003, 2002, 2003),
    pid = c(1, 1, 1, 2, 2, 3, 3)
) %>%
    group_by(pid) %>%
    filter(n() == 3)
#> # A tibble: 3 x 2
#> # Groups:   pid [1]
#>    year   pid
#>   <dbl> <dbl>
#> 1  2001     1
#> 2  2002     1
#> 3  2003     1

Created on 2021-01-05 by the reprex package (v0.3.0)

So you do not have to create cnt as an intermediary variable. Depending on what you want to do afterward you may call ungroup().

BerriJ
  • 913
  • 7
  • 18