Subsetting dataframe by multiple row and column matches using dplyr

Question

I've searched for an answer to this question for awhile and haven't found this exact test case. Please accept my apologies if it's been solved elsewhere.

I have a large dataframe data of biological observations (~2,500,000 rows, ~50 columns) from a government survey program. The important columns are reg (region), species, and year (of the survey).

I would like to subset the dataframe to contain only the species that were found at least once per year, in each region. (I also need to do this for species found at least 10 times per year, but I assume the solution is the same.)

In other words, I need to filter out only the rows for which a species is present in every year that a region is also present - essentially, all the rows where unique(year) for each species matches unique(year) for each reg. (There are only nine regions, so I could certainly subset the data into nine dataframes and repeat this filtering nine times over, but I'm hoping there is a more elegant solution using only dplyr that I can apply directly to data.)

Thank you!

EDIT: I created a mock dataset as suggested by @aichao.

> year = c(1999,2000,2000,2004,2004,2008)
> reg = c('ai','ci','ci','ai','ai','ci')
> species = c('blueei','greenei','yellowei','blueei','greenei','yellowei')
> df <- data.frame(year, reg, species)
> df
  year reg  species
1 1999  ai   blueei
2 2000  ci  greenei
3 2000  ci yellowei
4 2004  ai   blueei
5 2004  ai  greenei
6 2008  ci yellowei

So in this df, I want to keep only the species that were found every time the region was surveyed (let's pretend this is the complete data). From ai, I want to keep blueei - which was found in every survey year - but not greenei. Similarly, in ci, I want to keep yellowei but not greenei, since greenei did not turn up in 2008.

please submit a minimal exampe with `dput(data)` or a subset of the data frame — J_F, Sep 22 '16 at 22:04
@Alexa, welcome to SO. It will help ppl to answer you question if they have a sample of your data - http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example — Bulat, Sep 22 '16 at 22:05

aichao · Answer 1 · 2016-09-23T18:48:25.263

1

to keep only the species that were found every time the region was surveyed

We can do the following using dplyr:

library(dplyr)
result <- df %>% group_by(reg) %>% mutate(num.years.in.reg = length(unique(year))) %>% 
                 group_by(reg,species) %>% filter(length(unique(year)) == first(num.years.in.reg)) %>%
                 select(-num.years.in.reg) %>% arrange(reg)

Notes:

First, group_by reg and create a column num.years.in.reg with the number of unique years that region is surveyed.
Then, group_by both reg and species and keep only those species for which the number of unique years that the species is surveyed in the region is equal to the number of unique years that region is surveyed. Here, first just retrieves the first value from num.years.in.reg since they will be the same for all rows in the group.
Remove the num.years.in.reg column and sort the result by reg.
Note that in the filtering logic, we do not need to compare the unique year values, just the number of unique years, because the second grouping is a subset of the first grouping. That is, the unique years will be the same but there may be less of them in the second grouping than the first.

With the data you posted, the result is:

print(result)
##Source: local data frame [4 x 3]
##Groups: reg, species [2]
##
##   year   reg  species
##  <dbl> <chr>    <chr>
##1  1999    ai   blueei
##2  2004    ai   blueei
##3  2000    ci yellowei
##4  2008    ci yellowei

edited Sep 23 '16 at 18:48

answered Sep 22 '16 at 22:32

aichao

7,375
3
16
18

Thanks so much for the help, @aichao. Sorry that I didn't post the data in a more helpful format. I'm trying to identify species that exist in ALL years that a region was sampled. So if a region was sampled in 12 years, I want to subset only the species that were also sampled in each of those 12 years. This varies by region, though - they were all sampled differently. I'm not sure what `filter(n()>1)` does - can it accomplish that? – AFH Sep 22 '16 at 22:50
@Alexa: sorry for the delayed response. By grouping over `year`, `reg`, and `species`, we divide the data so that each group we has a unique triplet of (`year`, `reg`, `species`) values, for example, (`2010`,`ai`,`greenei`). Now, each group may have only one row of data or may have more than one row. All `filter(n()>1)` does is to keep only those groups for which there are more than one row in the group. This was my interpretation of what you want, but I may be wrong. Please let me know. – aichao Sep 23 '16 at 12:34
Oh, now I get it. Thank you! This is not quite what I was going for. In your example data frame above, I would actually want to keep all of those rows. I'm trying to keep all the species that were found _in every year of the regional survey_, however many years the region has been surveyed. So if `ai` was only surveyed in 2010, I'd keep all the species found in 2010. If `ci` is surveyed in 2010, 2011, and 2012, I want to keep _only_ the species found in each survey year - 2010, 2011, and 2012. I'll re-edit the original question with another mock df that is more helpful. – AFH Sep 23 '16 at 15:43
@Alexa: please see my edit and let me know if it now does what you want. Sorry for the delay in understanding. Thanks! – aichao Sep 23 '16 at 17:32
Thank you so much! This is REALLY close. When I run the code as is, though, it drops all of the 2.5 million rows except 649 rows, which only contain 41 species. I think this is because of `filter(n() == first(num.years.in.reg))`, which "keeps only those species for which the number of rows in the group, found by n(), is equal to the number of unique years that region is surveyed". However, in this particular dataset, there are dozens or hundreds of surveys (trawl tows) in a year, so most species will have many more rows in the group than number of unique years the region is surveyed. – AFH Sep 23 '16 at 18:16
@Alexa: please see my new edit. I believe that I'm finally zeroing in on the filtering logic, which you did specify in your original post. Thanks for your patience. – aichao Sep 23 '16 at 18:41
@Alexa: I realized that you already accepted my answer, but I was wondering if it actually worked for your 2.5 million rows! If so, I think I know an extension to do this for species found at least 10 times per year. Please let me know. Thanks! – aichao Sep 23 '16 at 20:48
So ... @tclavelle (also at UCSB) and I combed through this last week and realized that (1) his solution was right, but (2) the dataset I'm using was already filtered this way, so it didn't remove any species (stupid mistake on my part). In retrospect, I think there was still some error with your solution, but either way we figured it out. I am still working on the 10x/year filter with the code below. Thank you for all the help! – AFH Sep 26 '16 at 22:40

score 1 · Accepted Answer · answered Sep 23 '16 at 20:31

You can create an id variable in your original data frame representing the region + species pair.

 year = c(1999,2000,2000,2004,2004,2008)
 reg = c('ai','ci','ci','ai','ai','ci')
 species = c('blueei','greenei','yellowei','blueei','greenei','yellowei')

df <- data.frame(year, reg, species) %>%
 mutate(ids = paste(reg, species, sep='-'))

Then create a new data frame that identifies, for each region, the total number of years, how many years each species was present, and which species were present in all years.

 df2 <- df %>%
 group_by(reg) %>%
 mutate(n_yrs = length(unique(year))) %>%
 group_by(reg, ids) %>%
 summarize(present_yrs = length(unique(year)),
         all_yrs     = mean(n_yrs, na.rm = T)) %>%
 filter(present_yrs == all_yrs)

Lastly, filter your old dataset to only include the id's in df2

filter(df, ids %in% df2$ids)

Subsetting dataframe by multiple row and column matches using dplyr

2 Answers2