Create loop that iteratively aggregates data for sequentially larger cluster sizes

Question

First, data beings in the format of a data frame df. I have converted df to a species by plot matrix mat (figuring it will be easier to work from this format). Species are rows and plots are columns. Cells represent the frequency the species was found in that plot.

set.seed(3421)
df<-data.frame(plot= as.factor(c(rep(1,4),rep(2,4),rep(3,3),rep(4,2),
                   rep(5,6),rep(6,7))),
           species= sample(letters[1:26], size= 26, replace=TRUE))

library("tidyverse")
df<- 
  df%>%
  group_by(plot, species)%>%
  summarize(freq= length(species))
mat<- dcast(df , species~plot, value.var = "freq", fill=0 )
mat<- matrix(c(1,0,0,0,0,0,1,0,0,0,0,0,0,0,2,0,
           2,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,
           0,0,1,1,1,0,0,0,0,0,0,0,0,0,0,0,
           0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,
           0,0,0,0,1,0,0,1,0,1,1,0,0,1,0,1,
           1,1,0,0,0,0,0,0,1,0,1,2,1,0,0,0), nrow=16, ncol=6)
dimnames(mat)<- list(c("a", "c", "f", "h", "i", "j", "l", "m", "p", "q", "s", "t", "u", "v","x", "z"), 
                  c("1", "2", "3", "4", "5", "6"))

I would like to create a loop that iterates through the df to create a list of matrices for each cluster size such that a matrix for each cluster size includes multiple unique aggregates of plots. Given my example data frame, cluster sizes can range from the aggregation of 1 plot to all 6 plots combined. For example, for cluster size=1, a single plot is its own cluster, so results are simply the frequency of each species in that plot. For a cluster of size =2, a cluster is defined as the aggregation of two plots. Results will be the sum of frequencies for each species across TWO aggregated plots. Similarly, for a cluster of size=3, a cluster is defined as the aggregation of THREE plots and results are the sums of frequencies for each species across THREE aggregated plots.

For n cluster sizes, plots can be aggregated i times to achieve a cluster of that size. For example, in a cluster size of 2 we may aggregate: plot 1 & plot 2, plot 2 & plot 3 AND plot 5 & plot 10.

I wish to cluster using a moving window method. So, for a cluster size of 2, plots would be aggregated as follows: 1&2, 2&3, 3&4, 4&5.....11&12.

I imagine the way to go about this is to loop through the original data frame or matrix and output a new matrix for each cluster size. Below I provide examples of output matrices for cluster sizes 1-3 for the example data frame above.

Example output matrices for cluster size 1: Aggregates of 1 plot

mat1<- matrix(c(1,0,0,0,0,0,1,0,0,0,0,0,0,0,2,0), nrow=16, ncol=1)
dimnames(mat1)<- list(c("a", "c", "f", "h", "i", "j", "l", "m", "p", "q", "s", "t", "u", "v","x", "z"), 
                  c("1"))
mat2<- matrix(c(2,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0), nrow=16)
dimnames(mat2)<- list(c("a", "c", "f", "h", "i", "j", "l", "m", "p", "q", "s", "t", "u", "v","x", "z"), 
                  c("1"))
mat3<- matrix(c(0,0,1,1,1,0,0,0,0,0,0,0,0,0,0,0), nrow=16)
dimnames(mat3)<- list(c("a", "c", "f", "h", "i", "j", "l", "m", "p", "q", "s", "t", "u", "v","x", "z"), 
                  c("1"))
mat4<- matrix(c(0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0), nrow=16)
dimnames(mat4)<- list(c("a", "c", "f", "h", "i", "j", "l", "m", "p", "q", "s", "t", "u", "v","x", "z"), 
                  c("1"))
mat5<- matrix(c(0,0,0,0,1,0,0,1,0,1,1,0,0,1,0,1), nrow=16)
dimnames(mat5)<- list(c("a", "c", "f", "h", "i", "j", "l", "m", "p", "q", "s", "t", "u", "v","x", "z"), 
                  c("1"))
mat6<- matrix(c(1,1,0,0,0,0,0,0,1,0,1,2,1,0,0,0), nrow=16)
dimnames(mat6)<- list(c("a", "c", "f", "h", "i", "j", "l", "m", "p", "q", "s", "t", "u", "v","x", "z"), 
                  c("1"))

Example output matrix for cluster size 2: Aggregates of 2 plots

mat7<- matrix( c(3,0,0,0,1,0,1,0,0,0,0,0,0,0,2,0,
             2,0,1,1,2,0,0,0,0,0,0,0,0,0,1,0,
             0,1,1,1,1,1,0,0,0,0,0,0,0,0,0,0,
             0,1,0,0,1,1,0,1,0,1,1,0,0,1,0,1,
             1,1,0,0,1,0,0,1,1,1,2,2,1,1,0,1), nrow=16)
dimnames(mat7)<- list(c("a", "c", "f", "h", "i", "j", "l", "m", "p", "q", "s", "t", "u", "v","x", "z"), 
                  c("1_2", "2_3", "3_4", "4_5", "5_6"))

Example output matrix for cluster size 3: Aggregates of 3 plots

mat8<- matrix( c(3,0,1,1,2,0,1,0,0,0,0,0,0,0,3,0,
             2,1,1,1,2,1,0,0,0,0,0,0,0,0,1,0,
             0,1,1,1,2,1,0,1,0,1,1,0,0,1,0,1,
             1,2,0,0,1,1,0,1,1,1,2,2,1,1,0,1), nrow=16)
dimnames(mat8)<- list(c("a", "c", "f", "h", "i", "j", "l", "m", "p", "q", "s", "t", "u", "v","x", "z"), 
                  c("1_2_3", "2_3_4", "3_4_5", "4_5_6"))

Note that in each matrix, rows represent species and columns are "moving window" clusters of plot aggregates for said cluster size. I have named the column headings accordingly to indicate which plots are combined to achieve that cluster size. Ideally the loop would also indicate this information. Cells are the frequency of each species for a unique aggregate of n plots. Because cluster size limits the number of possible plot aggregations, the resulting matrices will vary in dimension lengths.

All matrices can be stored in a list. I primarily need help up to this step.

mat_list<- list(mat1, mat2, mat3, mat4, mat5, mat6, mat7, mat8 )

An extra step I would like to incorporate into a loop is to apply a series of functions to each matrix in the list. The result for each function can be added as a new column to the matrix. The functions I need to calculate for each cluster matrix are:

Calculate frequency for each species among all aggregates (ie. row totals ).
Calculate mean frequency for each species among all aggregates (ie. row totals/length of row )
Calculate the total area for each cluster size, here defined as the product of cluster size * pi * 25
Calculate the frequency per area. Divide mean frequency/ area

The output data frame for these three clusters will look like result_df:

#df for cluster size 1
result_df1<- data.frame(cluster_size= rep("1", 96), 
                    aggregate_ID= c(rep("1",16), rep("2", 16), rep("3", 16), rep("4", 16), rep("5", 16), rep("6",16)), 
                    species= rep(c("a", "c", "f", "h", "i", "j", "l", "m", "p", "q", "s", "t", "u", "v","x", "z"), 6), 
                    freq= c(1,0,0,0,0,0,1,0,0,0,0,0,0,0,2,0,
                    2,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,
                    0,0,1,1,1,0,0,0,0,0,0,0,0,0,0,0,
                    0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,
                    0,0,0,0,1,0,0,1,0,1,1,0,0,1,0,1,
                    1,1,0,0,0,0,0,0,1,0,1,2,1,0,0,0), 
mean_freq=c(1,0,0,0,0,0,1,0,0,0,0,0,0,0,2,0,
        2,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,
        0,0,1,1,1,0,0,0,0,0,0,0,0,0,0,0,
        0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,
        0,0,0,0,1,0,0,1,0,1,1,0,0,1,0,1,
        1,1,0,0,0,0,0,0,1,0,1,2,1,0,0,0),
area= rep(78.54, 96))
result_df1$freq_per_area<- result_df1$mean_freq/78.54               

#df for cluster size 2                                          
result_df2<- data.frame( cluster_size= rep("2",80), 
                     aggregate_ID= c(rep("1_2",16), rep("2_3",16), rep("3_4",16), rep("4_5",16), rep("5_6",16)),
                     species= c("a", "c", "f", "h", "i", "j", "l", "m", "p", "q", "s", "t", "u", "v","x", "z"),
                     freq=c(6,3,2,2,6,2,1,2,1,2,3,2,1,2,3,2),
                     mean_freq=(c(6,3,2,2,6,2,1,2,1,2,3,2,1,2,3,2)/5), 
                     area= rep(157.08, 16))
result_df2$freq_per_area<- result_df2$mean_freq/157.08              

#df for cluster size 3
result_df3<- data.frame( cluster_size= rep("3",64), 
                     aggregate_ID= c(rep("1_2_3",16), rep("2_3_4",16), rep("3_4_5",16), rep("4_5_6",16)),
                     species= c("a", "c", "f", "h", "i", "j", "l", "m", "p", "q", "s", "t", "u", "v","x", "z"),
                     freq=c(6,4,3,3,7,3,1,2,1,2,3,2,1,2,4,2),
                     mean_freq=(c(6,4,3,3,7,3,1,2,1,2,3,2,1,2,4,2)/5), 
                     area= rep(157.08, 16))
 result_df3$freq_per_area<- result_df3$mean_freq/235.62

 result_df<- rbind(result_df1,result_df2,result_df3)

Note that result_df includes results for a cluster size up to three, but for this example data frame clusters sizes would be a big as 6, so the loop would need to iterate up to the maximum cluster size.

Hi @Danielle, do you think you can reduce your question to something very specific which can be demonstrated on the minimal required object(s)? — Bulat, Sep 19 '19 at 22:48
@Bulat the first step I need the most help with is creating the list of matrices for each cluster size. How to aggregate plots iteratively is the most challenging step. I can figure out how to run the various functions once I have that list, so if it helps you to omit the last part of the post where I request running functions on those matrices within the list then please omit that. — Danielle, Sep 19 '19 at 23:56
I have made a few edits to my OP to clarify what specifically I primarily need help with — Danielle, Sep 20 '19 at 00:21
Here is a link to a more specific question relevant to the first part: https://stackoverflow.com/q/58021216/8061255 — Danielle, Sep 20 '19 at 03:00
I am glad you got the answer for the more specific question already. Have a look here https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example to further improve this one. — Bulat, Sep 20 '19 at 06:41

Create loop that iteratively aggregates data for sequentially larger cluster sizes

0 Answers0