How to select (four) specific rows (multiple times) based on a column value in R?

Question

I only want to select the ID's that are in my dataframe for all years, from 2013 untill 2016 (so four times). In that case ID's with only four rows are left (panel data, each ID has 1 row for each year). I already made sure my dataframe only covers the years I need (2013, 2014, 2015, and 2016), but I want to exclude the ID's that have less than 4 years/rows in my dataframe.

This is the structure of my dataframe:

 tibble [909,587 x 26] (S3: tbl_df/tbl/data.frame)
     $ ID                         : num [1:909587] 12 12 12 12 16 16 16 16...
     $ Gender                     : num [1:909587] 2 2 2 2 1 1 1 1 1 1 ...
      ..- attr(*, "format.spss")= chr "F10.0"
     $ Year                       : chr [1:909587] "2016" "2013" "2014" "2015" ...
      ..- attr(*, "format.spss")= chr "F9.3"
     $ Size                       : num [1:909587] 1983 1999 1951 1976 902 ...
     $ Costs                      : num [1:909587] 2957.47 0 0.34 1041.67 0 ...
     $ Urbanisation               : num [1:909587] 2 3 3 2 3 3 2 2 2 3 ...
     $ Age                        : num [1:909587] 92 89 90 91 82 83 22 23 24 65 ...

How can I achieve that?

Thank you!

Please read about how to provide a [good example](https://stackoverflow.com/q/5963269/2572423). Providing just the structure of your data is not that helpful. Consider using `dput`. That said, this _might_ work: `df %>% group_by(ID) %>% filter(n_distinct(Year) >= 4)` — JasonAizkalns, May 15 '20 at 12:25
Thank you Jason, I hoped the structure was sufficient. Good news tho; you code worked! Now I have only ID's with 4 years. Thank you! — Student0172, May 15 '20 at 12:30
Hi Jason, thank you for doing that! I just found out the code 'missed' a couple of years/rows/ID's. I still have 174 more rows than I should have based on lenght(unique(df$ID) * 4 rows.. Any idea how? — Student0172, May 15 '20 at 15:18

score 2 · Answer 1 · answered May 15 '20 at 12:27

2

Pivot your df

df %>% pivot_wider(names_from = Year,values_from = Age)

Filter the na's rows out of columns 2013,2014,2015,2016

Pivot back

df %>% pivot_longer(2013:2016)

answered May 15 '20 at 12:27

Bruno

4,109
1
9
27

1

Thank you Bruno. Jason's code above worked and because I have almost 1M rows (and 40 columns), I don't really want to take the risk of trying your code as well... Im quite new to R so mistakes are happening a lot, sorry! Thank you for your time and help :) – Student0172 May 15 '20 at 12:39

score 2 · Accepted Answer · answered May 15 '20 at 13:06

2

Just to capture @Jasonaizkains answer from the comments field above, since pivoting is not strictly necessary in this case with some play data.

library(dplyr)
id <- rep(10:13, 4) # four subjects
year <- rep(2013:2016, each = 4) # four years
gender <- sample(1:2, 16, replace = TRUE)
play <- tibble(id, gender, year) # data.frame of 16

play <- play[-9,] # removes row for id 10 in 2015

# Removes all entries for the right id number
play %>% group_by(id) %>% filter(n_distinct(year) >= 4) %>% ungroup()
#> # A tibble: 12 x 3
#>       id gender  year
#>    <int>  <int> <int>
#>  1    11      1  2013
#>  2    12      2  2013
#>  3    13      2  2013
#>  4    11      1  2014
#>  5    12      2  2014
#>  6    13      1  2014
#>  7    11      2  2015
#>  8    12      2  2015
#>  9    13      2  2015
#> 10    11      2  2016
#> 11    12      2  2016
#> 12    13      1  2016

answered May 15 '20 at 13:06

Chuck P

3,862
3
9
20

Hi Chuck, thank you for doing that! I just found out the code 'missed' a couple of years/rows/ID's. I still have 174 more rows than I should have based on lenght(unique(df$ID) * 4 rows.. Any idea how? – Student0172 May 15 '20 at 15:18
Any chance an ID has more than 4? Other simple possibility, you are remembering to put the results somewhere either back in `play` or in say `play2` – Chuck P May 15 '20 at 15:24
You can generate output in descending order with this and see what pops to the top or bottom `play %>% group_by(id) %>% summarise(id_count = n_distinct(year)) %>% arrange(id_count)` (low numbers) or `> play %>% group_by(id) %>% summarise(id_count = n_distinct(year)) %>% arrange(desc(id_count))` (high numbers) – Chuck P May 15 '20 at 15:32
Hi Chuck, probably not. I've checked with the table function, and saw that after your first code, that the amount of unique ID's per year were different. For 2013, there were 156.979 ID's, for 2014 156.976, 2015 156982 and for 2016 156.985. I've also checked with your code for 1, 2, 3, 5 or more years, and then my tables were all empty for 5 or more years. – Student0172 May 15 '20 at 17:14
Instead of the second 'table' I mean dataframe. Sorry! – Student0172 May 15 '20 at 17:29
Hmmm hard to understand without direct access, did you try this code? `play %>% group_by(id) %>% summarise(id_count = n_distinct(year)) %>% arrange(id_count)` – Chuck P May 15 '20 at 17:42

How to select (four) specific rows (multiple times) based on a column value in R?

2 Answers2