Remove rows in data set based on multiple criteria

Question

My data set contains animal ID, date, year, month, and day. I need to remove all animal IDs that have less than 40 locations (in this case 40 rows in R) in a given year. In other words, animal ID = 1 has 20 locations in 2001; therefore, remove this individual from the data set. I then need to calculate how many months worth of data there is for the remaining set of records. In other words, I need to have >= 40 locations per animal ID per year spanned across at least 6 months. Example: Animal ID 2 had > 40 rows of data in 2001 so it met the first criteria mentioned above but those 40 rows of data in 2001 only span 3 months; therefore, this individual needs to be removed from the data set. I can't seem to figure out a quick way in R to essentially subset my data set to address the 2 aforementioned questions.

Initial coding I've started working on:

newdata<-data[as.character(ave(data$Animal_ID, data$Animal_ID, FUN=length)) >= 40, ]

But I know this isn't exactly correct.

Sample data set

dput(dataset)
structure(list(Animal_ID = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 
4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 
4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 
4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 
4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 5L, 5L, 5L, 
5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 
5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 6L, 6L, 6L, 6L, 
6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 
6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 
6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 
6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 
6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 
6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L), Date = structure(c(1L, 
2L, 39L, 46L, 43L, 53L, 55L, 57L, 62L, 72L, 77L, 77L, 78L, 79L, 
80L, 81L, 81L, 81L, 82L, 83L, 84L, 84L, 84L, 85L, 86L, 87L, 87L, 
88L, 92L, 102L, 102L, 103L, 104L, 104L, 104L, 104L, 104L, 104L, 
104L, 104L, 104L, 105L, 89L, 89L, 90L, 90L, 90L, 91L, 93L, 93L, 
94L, 95L, 96L, 96L, 97L, 97L, 98L, 98L, 98L, 98L, 98L, 98L, 98L, 
98L, 99L, 100L, 117L, 118L, 120L, 106L, 108L, 109L, 111L, 115L, 
116L, 3L, 3L, 8L, 13L, 15L, 16L, 17L, 18L, 19L, 4L, 45L, 47L, 
51L, 48L, 52L, 52L, 61L, 63L, 63L, 64L, 54L, 56L, 58L, 58L, 59L, 
60L, 60L, 60L, 71L, 73L, 74L, 75L, 76L, 76L, 65L, 66L, 66L, 67L, 
68L, 69L, 70L, 40L, 41L, 42L, 44L, 45L, 47L, 49L, 49L, 49L, 49L, 
49L, 49L, 49L, 49L, 50L, 50L, 51L, 89L, 90L, 91L, 93L, 94L, 94L, 
94L, 94L, 94L, 94L, 94L, 96L, 97L, 99L, 100L, 100L, 101L, 117L, 
118L, 118L, 119L, 120L, 121L, 106L, 107L, 107L, 108L, 109L, 110L, 
111L, 112L, 113L, 114L, 114L, 115L, 115L, 116L, 3L, 3L, 8L, 13L, 
17L, 18L, 18L, 19L, 4L, 5L, 5L, 6L, 7L, 9L, 9L, 10L, 11L, 12L, 
14L, 14L, 26L, 27L, 28L, 29L, 30L, 20L, 20L, 21L, 21L, 22L, 23L, 
24L, 25L, 34L, 35L, 37L, 38L, 31L, 32L, 33L, 36L), .Label = c("1/23/2001", 
"1/30/2001", "10/1/2002", "10/10/2002", "10/14/2002", "10/17/2002", 
"10/18/2002", "10/2/2002", "10/21/2002", "10/23/2002", "10/25/2002", 
"10/28/2002", "10/3/2002", "10/30/2002", "10/4/2002", "10/6/2002", 
"10/7/2002", "10/8/2002", "10/9/2002", "11/12/2002", "11/13/2002", 
"11/15/2002", "11/21/2002", "11/25/2002", "11/27/2002", "11/4/2002", 
"11/5/2002", "11/6/2002", "11/7/2002", "11/8/2002", "12/11/2002", 
"12/13/2002", "12/17/2002", "12/2/2002", "12/3/2002", "12/30/2002", 
"12/6/2002", "12/9/2002", "2/21/2001", "3/11/2002", "3/13/2002", 
"3/22/2002", "3/23/2001", "3/23/2002", "3/25/2002", "3/8/2001", 
"4/1/2002", "4/10/2002", "4/2/2002", "4/5/2002", "4/7/2002", 
"5/1/2002", "5/13/2001", "5/14/2002", "5/15/2001", "5/15/2002", 
"5/17/2001", "5/20/2002", "5/28/2002", "5/29/2002", "5/3/2002", 
"5/30/2001", "5/8/2002", "5/9/2002", "6/10/2002", "6/12/2002", 
"6/13/2002", "6/17/2002", "6/19/2002", "6/20/2002", "6/3/2002", 
"6/4/2001", "6/4/2002", "6/5/2002", "6/6/2002", "6/7/2002", "7/11/2002", 
"7/12/2002", "7/15/2002", "7/16/2002", "7/17/2002", "7/18/2002", 
"7/24/2002", "7/25/2002", "7/27/2002", "7/29/2002", "7/31/2002", 
"8/1/2002", "8/12/2002", "8/14/2002", "8/19/2002", "8/2/2002", 
"8/20/2002", "8/21/2002", "8/22/2002", "8/23/2002", "8/26/2002", 
"8/27/2002", "8/28/2002", "8/29/2002", "8/30/2002", "8/5/2002", 
"8/7/2002", "8/8/2002", "8/9/2002", "9/10/2002", "9/11/2002", 
"9/13/2002", "9/16/2002", "9/17/2002", "9/18/2002", "9/19/2002", 
"9/20/2002", "9/23/2002", "9/25/2002", "9/26/2002", "9/3/2002", 
"9/4/2002", "9/5/2002", "9/6/2002", "9/9/2002"), class = "factor"), 
    Year = c(2001L, 2001L, 2001L, 2001L, 2001L, 2001L, 2001L, 
    2001L, 2001L, 2001L, 2002L, 2002L, 2002L, 2002L, 2002L, 2002L, 
    2002L, 2002L, 2002L, 2002L, 2002L, 2002L, 2002L, 2002L, 2002L, 
    2002L, 2002L, 2002L, 2002L, 2002L, 2002L, 2002L, 2002L, 2002L, 
    2002L, 2002L, 2002L, 2002L, 2002L, 2002L, 2002L, 2002L, 2002L, 
    2002L, 2002L, 2002L, 2002L, 2002L, 2002L, 2002L, 2002L, 2002L, 
    2002L, 2002L, 2002L, 2002L, 2002L, 2002L, 2002L, 2002L, 2002L, 
    2002L, 2002L, 2002L, 2002L, 2002L, 2002L, 2002L, 2002L, 2002L, 
    2002L, 2002L, 2002L, 2002L, 2002L, 2002L, 2002L, 2002L, 2002L, 
    2002L, 2002L, 2002L, 2002L, 2002L, 2002L, 2002L, 2002L, 2002L, 
    2002L, 2002L, 2002L, 2002L, 2002L, 2002L, 2002L, 2002L, 2002L, 
    2002L, 2002L, 2002L, 2002L, 2002L, 2002L, 2002L, 2002L, 2002L, 
    2002L, 2002L, 2002L, 2002L, 2002L, 2002L, 2002L, 2002L, 2002L, 
    2002L, 2002L, 2002L, 2002L, 2002L, 2002L, 2002L, 2002L, 2002L, 
    2002L, 2002L, 2002L, 2002L, 2002L, 2002L, 2002L, 2002L, 2002L, 
    2002L, 2002L, 2002L, 2002L, 2002L, 2002L, 2002L, 2002L, 2002L, 
    2002L, 2002L, 2002L, 2002L, 2002L, 2002L, 2002L, 2002L, 2002L, 
    2002L, 2002L, 2002L, 2002L, 2002L, 2002L, 2002L, 2002L, 2002L, 
    2002L, 2002L, 2002L, 2002L, 2002L, 2002L, 2002L, 2002L, 2002L, 
    2002L, 2002L, 2002L, 2002L, 2002L, 2002L, 2002L, 2002L, 2002L, 
    2002L, 2002L, 2002L, 2002L, 2002L, 2002L, 2002L, 2002L, 2002L, 
    2002L, 2002L, 2002L, 2002L, 2002L, 2002L, 2002L, 2002L, 2002L, 
    2002L, 2002L, 2002L, 2002L, 2002L, 2002L, 2002L, 2002L, 2002L, 
    2002L, 2002L, 2002L, 2002L, 2002L, 2002L), Month = c(1L, 
    1L, 2L, 3L, 3L, 5L, 5L, 5L, 5L, 6L, 7L, 7L, 7L, 7L, 7L, 7L, 
    7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 8L, 8L, 8L, 8L, 
    8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 
    8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 
    8L, 8L, 8L, 8L, 8L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 10L, 
    10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 3L, 4L, 4L, 
    4L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 
    6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 3L, 3L, 
    3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 
    8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 
    8L, 8L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 
    9L, 9L, 9L, 9L, 9L, 9L, 9L, 10L, 10L, 10L, 10L, 10L, 10L, 
    10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 
    10L, 10L, 11L, 11L, 11L, 11L, 11L, 11L, 11L, 11L, 11L, 11L, 
    11L, 11L, 11L, 12L, 12L, 12L, 12L, 12L, 12L, 12L, 12L), Day = c(23L, 
    30L, 21L, 8L, 23L, 13L, 15L, 17L, 30L, 4L, 11L, 11L, 12L, 
    15L, 16L, 17L, 17L, 17L, 18L, 24L, 25L, 25L, 25L, 27L, 29L, 
    31L, 31L, 1L, 2L, 5L, 5L, 7L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 
    8L, 8L, 9L, 12L, 12L, 14L, 14L, 14L, 19L, 20L, 20L, 21L, 
    22L, 23L, 23L, 26L, 26L, 27L, 27L, 27L, 27L, 27L, 27L, 27L, 
    27L, 28L, 29L, 3L, 4L, 6L, 10L, 13L, 16L, 18L, 25L, 26L, 
    1L, 1L, 2L, 3L, 4L, 6L, 7L, 8L, 9L, 10L, 25L, 1L, 7L, 10L, 
    1L, 1L, 3L, 8L, 8L, 9L, 14L, 15L, 20L, 20L, 28L, 29L, 29L, 
    29L, 3L, 4L, 5L, 6L, 7L, 7L, 10L, 12L, 12L, 13L, 17L, 19L, 
    20L, 11L, 13L, 22L, 23L, 25L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 
    2L, 2L, 5L, 5L, 7L, 12L, 14L, 19L, 20L, 21L, 21L, 21L, 21L, 
    21L, 21L, 21L, 23L, 26L, 28L, 29L, 29L, 30L, 3L, 4L, 4L, 
    5L, 6L, 9L, 10L, 11L, 11L, 13L, 16L, 17L, 18L, 19L, 20L, 
    23L, 23L, 25L, 25L, 26L, 1L, 1L, 2L, 3L, 7L, 8L, 8L, 9L, 
    10L, 14L, 14L, 17L, 18L, 21L, 21L, 23L, 25L, 28L, 30L, 30L, 
    4L, 5L, 6L, 7L, 8L, 12L, 12L, 13L, 13L, 15L, 21L, 25L, 27L, 
    2L, 3L, 6L, 9L, 11L, 13L, 17L, 30L)), .Names = c("Animal_ID", 
"Date", "Year", "Month", "Day"), class = "data.frame", row.names = c(NA, 
-211L))

In order for us to be able to answer your question, please include a sample of your data by typing `dput(variableName)` and copying and pasting the console output into your question. For more information on how to make a reproducible example in `R` (and make it more likely your question is answered) please view [this post](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example). — Barker, Apr 14 '17 at 17:02
(I recommend against using `data` or any other existing function as a variable name. It works just fine until you restart R and forget to assign your variables, at which point the error messages become a little obscure. See "side note" in my answer here: http://stackoverflow.com/a/43398262/3358272) — r2evans, Apr 14 '17 at 17:06
I have too many rows for Animal_ID to provide the dput structure for. In total, I have over 18,000 rows of data with some animals having >300 rows of data under one Animal_ID. I've shown the structure of the data above and added a sample of the data under the "Sample data set" link below my question. — Buck2079, Apr 14 '17 at 18:20
You can create a sample data that reflect your problem, not necessarily the real data or complete data. I thought that link is some download, then found it is a screenshot. You are asking people to create sample data for you in their solutions. Otherwise how can you test the code without data? — dracodoc, Apr 14 '17 at 19:03
To be honest, I'm new to R coding so I'm unsure how to create a sample data set based on the structure of my sample above. In other words, I'm not sure how to create multiple records per animal ID with different dates, years, and months. I can bring in another sample set for viewers. — Buck2079, Apr 15 '17 at 13:02

score 0 · Answer 1 · answered Apr 14 '17 at 18:29

0

One way to do it is to table the values in the ID column. Then, step through the table values and remove any lines that meet the criterion.

I've made up some data:

df = data.frame(ID = c(rep('otter',5),rep('beaver',3),rep('muskrat',4)),
      locations=sample(1:12))
# create the table
table.ID= table(df$ID)
for (i in 1:length(table.ID)) {
    # if the number of occurrences matches the criterion  
    if (table.ID[i] > 4) {
      # remove those rows by finding out which rows have ID
      # values that match the tabled name
      df = df[ -which(df$ID==names(table.ID)[i]), ]
    }
}

Any ID that occurs in more than 4 rows (locations) will be deleted.

answered Apr 14 '17 at 18:29

Edward Carney

1,372
9
7

Given my data structure above, would the R code look something like the following: – Buck2079 Apr 14 '17 at 19:04
Given my data structure above, would the R code look something like the following: # create the table table.ID= table(bobdata$Animal_ID) for (i in 1:length(table.ID)) { # if the number of occurrences matches the criterion if (table.ID[i] >= 40) { # remove those rows by finding out which rows have ID # values that match the tabled name df = df[ -which(bobdata$Animal_ID==names(table.ID)[i]), ] } } – Buck2079 Apr 14 '17 at 20:34
Is the df in the last line of code referring to the earlier created dataframe? Also, what is the "names' command doing? Thanks! – Buck2079 Apr 14 '17 at 20:35
Yes. You would use `bobdata = bobdata[-which(bobdata$Animal_ID==names(table.ID[i],]` to put the data back into the same data frame. You could also assign it to `bobdata.2` or something similar to preserve the original data frame. – Edward Carney Apr 14 '17 at 20:40
The `names` function returns the names of an object. In this case the `table.ID` object has names corresponding to the different animals and a count of how many of each name was found in the Animal_ID column. In your case, the names will be the character form of the numerals coding each animal's ID. `names(table.ID)` yields `[1] "1" "2" "3"` – Edward Carney Apr 14 '17 at 20:45
Thank you for addressing my questions! – Buck2079 Apr 14 '17 at 20:52

M_Shimal · Accepted Answer · 2017-04-15T15:22:43.973

0

You can do that easily in dplyr package. Assuming the name of dataset is animal_data here is how I would run this.

UPDATED - I admit I was careless before and had made a big mistake. But the following new code set would let you achieve your intended outcome, though I am sure it can be still improved.

library(dplyr)

animal_data_by_n <- new_data %>% 
  group_by(Animal_ID, Year) %>% 
  filter(n() >= 40) # Only selecting animals that have records greater than 40 records for a given year

animal_data_by_n_Month <- animal_data_by_n %>% 
  group_by(Animal_ID, Year) %>%  
  summarise(n_Month = n_distinct(Month))

new_output <- merge(animal_data_by_n, animal_data_by_n_Month, by=c("Animal_ID","Year"), all.x=TRUE)
Final_subset <- subset(new_output, n_Month >= 6)

You may remove the n_month column later from your final dataframe

edited Apr 15 '17 at 15:22

answered Apr 14 '17 at 20:07

M_Shimal

413
3
12

Thanks for that. I knew that there'd be a way to do it with `dplyr`, but I didn't have the time to pursue it. – Edward Carney Apr 14 '17 at 20:19
@M_Shimal...I ran a table function to make sure it worked correctly and found some individual animals had less than 6 months worth of data in a given year. In other words, I found examples like below that had >= 40 locations in a given year for an animal but those 40 locations did not span across >= 6 months. Thanks again for the help! Animal ID = 100 Year = 2006 Months = 0 10 12 16 7 0 0 0 0 0 0 0 – Buck2079 Apr 14 '17 at 20:51
@Buck2079 So you mean the code returned records that do not match the criteria? – M_Shimal Apr 14 '17 at 20:59
Yes. See example output in my previous comment. I found several cases like this that had >40 records per year per ID but the 40 records were only spread across 3-5 months rather than 6+ months. – Buck2079 Apr 14 '17 at 23:02
Sorry, can't work on this further without having a look at your dataset. Maybe you should consider `dput` data to show here. – M_Shimal Apr 15 '17 at 00:47
@M_Shimal....I tried the dput(VariableName) as you suggested but was unable to get a small enough sample to add to this site. With the number of rows in my data set, I would have over 100s of rows for the dput structure to add to this site, which cannot be done. If you know another way around that then let me know. – Buck2079 Apr 15 '17 at 13:05
dput(head(dataset_name,number of rows)) - you can get small enough sample set of your data by doing this – M_Shimal Apr 15 '17 at 13:18
@M_Shimal...thank you for the revised coding. It worked perfectly this time! Thanks again! – Buck2079 Apr 17 '17 at 17:46

Remove rows in data set based on multiple criteria

2 Answers2