0

I am working with the dataset HealthIns from the 'pglm' package in R. I would like to drop all the individuals that have a different from 5 number of observations (some of them are observed only for three years). Therefore I want to create a new dataframe only with the individuals for which I have the data for the years 1,2,3,4,5. Any suggestion about how I can do it? Thank you in advance

Lyuba
  • 25
  • 4
  • Please include the code that you have tried. This post might help https://stackoverflow.com/questions/20204257/subset-data-frame-based-on-number-of-rows-per-group – Ronak Shah Jul 09 '21 at 11:18

1 Answers1

0

First let's find out which ids are having data for all five years:

# Load library
library(tidyverse) 

complete <- HealthIns %>% 
  group_by(id) %>% 
  count() %>% 
  ungroup() %>% 
  filter(n == 5) %>% 
  pull(id)

Now we can use it to filter the data:

df <- HealthIns %>% 
  filter(id %in% complete)

Let's check if df is correct:

df %>% 
  group_by(year) %>% 
  count()

# A tibble: 5 x 2
# Groups:   year [5]
   year     n
  <dbl> <int>
1     1  1584
2     2  1584
3     3  1584
4     4  1584
5     5  1584

As you can see df is having same amount of observations for each year value.

Radbys
  • 400
  • 2
  • 10