2

I have a matrix of elements containing years, traits and species. But I want to select only certain species that are present a certain consecutive number of year. I can see this by looking at a table:

mat = matrix(c(2000,2001,2001,2002,2002,2003, 2004, 2005,
               2001,
               2000, 2001, 2002,  2005,
               2000, 2002, 2004, 2004,  2006,
               2,3,3,4,1,2,2,2,
               2,
               1,4,4,3,
               1,4,4,3,2,
               "sp1","sp1","sp1","sp1","sp1","sp1","sp1","sp1",
               "sp2",
               "sp3","sp3","sp3","sp3",
               "sp4","sp4","sp4","sp4","sp4"), nrow = 18)
mat = as.data.frame(mat)
colnames(mat) = c("yr","trait","sp")
res = table(mat$sp,mat$yr)

Here the table looks like this:

      2000 2001 2002 2003 2004 2005 2006
  sp1    1    2    2    1    1    1    0
  sp2    0    1    0    0    0    0    0
  sp3    1    1    1    0    0    1    0
  sp4    1    0    1    0    2    0    1
  1. But here, I want to remove sp2 from my analysis since it was seen only once in 2001 and no other years. Is there a way to do this? I've tried this, but it prints the exact same table:

    res[apply(res,1,function(z) any(z==0)),]
    

    In the end I'd like to delete sp2 from the ´mat´ data, but using the information in the table to delete sp2.

          2000 2001 2002 2003 2004 2005 2006
      sp1    1    2    2    1    1    1    0
      sp3    1    1    1    0    0    1    0
      sp4    1    0    1    0    2    0    1
    

    And the ´mat´ would look like:

         yr trait  sp
    1  2000     2 sp1
    2  2001     3 sp1
    3  2001     3 sp1
    4  2002     4 sp1
    5  2002     1 sp1
    6  2003     2 sp1
    7  2004     2 sp1
    8  2005     2 sp1
    10 2000     1 sp3
    11 2001     4 sp3
    12 2002     4 sp3
    13 2005     3 sp3
    14 2000     1 sp4
    15 2002     4 sp4
    16 2004     4 sp4
    17 2004     3 sp4
    18 2006     2 sp4
    
  2. Also, I want a second command that would allow me to select the individuals from ´mat´ that are seen consecutively in 2 or more years (this would delete sp4, since it was seen at even years only).

    Again, I've tried this, but it's not deleting the right information:

    mat[which(res != 0),]
    

    The end result would be:

          2000 2001 2002 2003 2004 2005 2006
      sp1    1    2    2    1    1    1    0
      sp3    1    1    1    0    0    1    0
    

    And the ´mat´ would look like:

         yr trait  sp
    1  2000     2 sp1
    2  2001     3 sp1
    3  2001     3 sp1
    4  2002     4 sp1
    5  2002     1 sp1
    6  2003     2 sp1
    7  2004     2 sp1
    8  2005     2 sp1
    10 2000     1 sp3
    11 2001     4 sp3
    12 2002     4 sp3
    13 2005     3 sp3
    

This would be applied to a much larger dataset. This is just a small example.

Frank
  • 66,179
  • 8
  • 96
  • 180
M. Beausoleil
  • 3,141
  • 6
  • 29
  • 61
  • So, `res` isn't necessarily the ideal format. There's no point in making a wide dataset when you've already got a nice data.frame, and you can use `dplyr`, `data.table` or split-lapply-combine to group/filter your data by sp. – Shape May 17 '17 at 15:18
  • For example, this is the (base), split-lapply-combine version: `do.call(rbind, lapply(split(mat, mat$sp), function(x) if(max(x$yr) < 2002) NULL else x))` where I've excluded everything that only appears before 2002 – Shape May 17 '17 at 15:23
  • I don't understand why you'd want to base that on a specific year. I'm more thinking generally here, in terms of "number of non-0 values" (in this case I want to delete the lines that contains only 1 non-0 values) or "more than X consecutive numbers" where X could be 3 in this case. – M. Beausoleil May 17 '17 at 15:30
  • `res[which(rowSums(1*(res!=0))>1),]` should give you a solution to your first problem – Michael Kirchner May 17 '17 at 15:32

3 Answers3

2

I don't think you need the res table to perform the filtering on your dataframe mat, you can do it directly using dplyr. In order to filter out a given sp that appears only one year, you can do:

library(dplyr)
mat %>% group_by(yr) %>% group_by(sp) %>% filter(n_distinct(yr)>1) %>% ungroup()

  yr  trait     sp
   <fctr> <fctr> <fctr>
1    2000      2    sp1
2    2001      3    sp1
3    2001      3    sp1
4    2002      4    sp1
5    2002      1    sp1
6    2003      2    sp1
7    2004      2    sp1
8    2005      2    sp1
9    2000      1    sp3
10   2001      4    sp3
11   2002      4    sp3
12   2005      3    sp3
13   2000      1    sp4
14   2002      4    sp4
15   2004      4    sp4
16   2004      3    sp4
17   2006      2    sp4

In order to filter out the sps that don't appear two consecutive years, you can do:

mat %>% group_by(sp)%>% filter(min(diff(sort(unique(yr))))==1)

This returns

yr  trait    sp
   <dbl> <fctr> <chr>
1      1      2   sp1
2      2      3   sp1
3      2      3   sp1
4      3      4   sp1
5      3      1   sp1
6      4      2   sp1
7      5      2   sp1
8      6      2   sp1
9      1      1   sp3
10     2      4   sp3
11     3      4   sp3
12     6      3   sp3

Note that this last operation returns a warning as sp2 only has one year. You can combine the two above operations:

mat %>% group_by(yr) %>% group_by(sp) %>% filter(n_distinct(yr)>1) %>% ungroup() %>% group_by(sp)%>% filter(min(diff(sort(unique(yr))))==1)

Which doesn't return the warning.

EDIT: In case you want to filter according to a specified number of consecutive years (not just 2), you could do:

## This function returns the max number of consecutive 1s +1 in a vector, and 0 if there are none or there is just one value in the vector
consec1=function(x){ifelse((1 %in% x),max(rle(x)$lengths[rle(x)$values==1])+1,0)}
## Then use it in your dplyr::filter
mat %>% group_by(sp) %>% filter(consec1(diff(sort(unique(yr))))==6)

Which returns:

 yr  trait     sp
  <dbl> <fctr> <fctr>
1  2000      2    sp1
2  2001      3    sp1
3  2001      3    sp1
4  2002      4    sp1
5  2002      1    sp1
6  2003      2    sp1
7  2004      2    sp1
8  2005      2    sp1
Lamia
  • 3,845
  • 1
  • 12
  • 19
  • Very nice! I wanted to change the code to have something that would apply for different filter e.g. `mat %>% group_by(sp) %>% filter(min(diff(sort(unique(yr))))==2)` to select the species that were present in at least 2 consecutive years. The odd part is that I'm getting as output only `sp4` which is never having 2 *consecutive years*. Do you know how I could do that? Say I want a threshold of 4 consecutive years. If I add `==4` it gives me nothing: `Source: local data frame [0 x 3] Groups: sp [0] # ... with 3 variables: yr , trait , sp ` – M. Beausoleil May 17 '17 at 19:54
  • 1
    `mat %>% group_by(sp) %>% filter(min(diff(sort(unique(yr))))==1)` keeps only the sp's where there are at least 2 consecutive years, and thus the minimum of the difference of the vector of unique years is 1. For example, `diff(c(2012,2013,2015))` returns 1 and 2. In the case of sp4, the minimum of the difference of consecutive years is 2, hence your result. – Lamia May 17 '17 at 20:02
  • I think I got it with your help! `mat %>% group_by(sp)%>% filter(sum(diff(sort(unique(yr))) ==1)>=2)` Does that make sense? I'm summing all the "difference" of 1 and looking at the ones that are more or equal to 2 (meaning that I want 2 consecutive years) – M. Beausoleil May 17 '17 at 20:15
  • 1
    As you guessed, this method doesn't extend to cases where you want to specify the number of consecutive years. I'll edit the answer to include a more general method. – Lamia May 17 '17 at 20:19
  • Well now it's working perfectly for me! Very useful command. I've tested it on a 712443 lines dataset and it's super fast (maybe 2-3 seconds)! – M. Beausoleil May 17 '17 at 20:21
  • Actually, if the difference between successive years is 1,1,2,1 (for ex. 2006,2007,2008,2010,2011) or 1,1,1,2 (for ex. 2006,2007,2008,2009,2011) your method would return 3 in both cases, whereas there are only 4 consecutive years in the second case. Check the method I've included in the edir, it tests for consecutive years using `rle`. – Lamia May 17 '17 at 21:04
1

Replicating your data:

mat = matrix(c(2000,2001,2001,2002,2002,2003, 2004, 2005,
               2001,
               2000, 2001, 2002,  2005,
               2000, 2002, 2004, 2004,  2006,
               2,3,3,4,1,2,2,2,
               2,
               1,4,4,3,
               1,4,4,3,2,
               "sp1","sp1","sp1","sp1","sp1","sp1","sp1","sp1",
               "sp2",
               "sp3","sp3","sp3","sp3",
               "sp4","sp4","sp4","sp4","sp4"), nrow = 18)
mat = as.data.frame(mat)
colnames(mat) = c("yr","trait","sp")
res = table(mat$sp,mat$yr)

Question One:

Multiplying the boolean matrix by 1 will give you something that you can use rowSums() on.

res <- res[which(rowSums(1*(res!=0))>1),]
res

Will give you:

      2000 2001 2002 2003 2004 2005 2006
  sp1    1    2    2    1    1    1    0
  sp3    1    1    1    0    0    1    0
  sp4    1    0    1    0    2    0    1

Question Two:

You can use rle() to detect run lengths.

res <- res[apply(res, 1, function(x) any(rle(x)$lengths > 1)),]
res

Will give you:

      2000 2001 2002 2003 2004 2005 2006
  sp1    1    2    2    1    1    1    0
  sp3    1    1    1    0    0    1    0
Michael Kirchner
  • 869
  • 1
  • 7
  • 17
  • If you need a more efficient method for question two, check: http://stackoverflow.com/questions/5012516/count-how-many-consecutive-values-are-true – Michael Kirchner May 17 '17 at 15:45
0
  1. I am a bit reluctant but:

     find_zeros<-function(vec){
        bool<-grepl("1{3}",paste(ifelse(vec==0,1,0),collapse = ""),perl = T)  
        return(bool)
     }
    

res[!apply(res,1,find_zeros),]

yielding in the:

       2000 2001 2002 2003 2004 2005 2006
   sp1    1    2    2    1    1    1    0
   sp3    1    1    1    0    0    1    0
   sp4    1    0    1    0    2    0    1

Of course in order to get the trimmed mat one should use:

    mat_trimmed<-mat[(mat$sp %in% row.names(final)),]
amonk
  • 1,769
  • 2
  • 18
  • 27