2

I'm currently trying to apply the resampling method described in Herrmann et al. (2022) with an outer and an inner resampling loop : enter image description here My dataset is composed of 100 rows by 60 columns. Rows correspond to tries ID (one try per row) and columns to species (one species per column). Each cell represents the count of individuals in each species i for each try K = NiK

NiK could vary with resampling, but the overall amount of individuals in a given try all species confounded (called NK, corresponding to the row sums) need to be constant.

Let's generate a similar data frame :

df <- matrix(sample(0:15, 6000, replace = T), ncol = 60)
df

Here is the code I use for the outer loop :

#Outer loop
#Resampling among the 100 tries 1000 times
seed(1234)
outer = matrix(replicate(1000, sample(1:nrow(df), nrow(df), replace = T)), ncol = 100)

And now adding the inner loop, where i'm stuck. Inner loop resamples the number of individuals for a given species within each try, while outer loop resamples tries among the new dataset generated by the inner loop :

i = species number

r = outer resampling ID

inner <- NULL
complete <- NULL
for (r in 1:1000) {
  for (i in 1:ncol(df)) {
    inner = cbind(inner,sample(df[,i], nrow(df), replace = T))
  }
  complete = rbind(complete, inner[outer[r,],])
  inner <- NULL
}
complete

The loop work fine to generate 1000 resampling of 100 tries through the inner and outer loops, but I have currently no idea of how to constrain each row to sum to NK, which could be calculated for a given try K :

rowSums(df[K,])

Does anyone know a function or a method that could permit to maintain NK constant while resampling in each column with the inner loop ?

Thanks by advance for your help, and hope my english and my explanations are understandable.

2 Answers2

1

Using rowSums(df) and rmultinom, we can resample a row and end up with the same total. Demonstrating:

rs <- rowSums(df)
all(replicate(100, sum(rmultinom(1, rs[1], df[1,])) == rs[1]))
#> [1] TRUE

Since the inner samples are to be summed, for each outer sample, we can row sum multiple row resamples.

set.seed(1234)
K <- 100L # "tries"
N <- 60L # species
R <- 1e3L # resamples
df <- matrix(sample(0:15, N*K, replace = 1), K, N)
rs <- rowSums(df)

outer <- mapply( # outer sampling
  \(x) rowSums(
    # inner sampling
    mapply(\(i) rmultinom(1, rs[i], df[i,]), sample(nrow(df), K, 1))
  ),
  1:R
)

dim(outer)
#> [1]   60 1000

Each column of outer now corresponds to an outer resample.

jblood94
  • 10,340
  • 1
  • 10
  • 15
0

Thank you jblood94 for your quick answer ! Maybe I misunderstood your answer, but my aim is to resample columns with constant row sums, and not resample row with constant row sums.

I finally find a solution to my problem. I was solving the problem with the wrong dataframe.

Indeed, my original dataset look more like :

ID <- sort(rep(1:100,times=c(sample(1:10, 100, replace = T))))
df <- data.frame(ID, Species.name= sample(letters[1:26], length(ID), replace = T))
df

with ID = Try ID and Species.name the identification of the species.

using acast() I obtained a similar dataframe as described in my original post :

 library(reshape2)
 acast(df, ID~Species.name)

suitable to calculate diversity index.

Nonetheless, with df, it is easier to perform a nested sampling while keeping the number of individuals per try, since an individual is equal to a row !

The working code :

 library(reshape2)
   library(dplyr)
#Get species complete list
sptot <- unique(df$Species.name)
length(sptot)

#Get try ID
nID <- unique((df$ID))
length(nID)

#Extract number of individuals per try to keep it constant
groupsize = df %>%
  group_by(ID)%>%
  summarise(n = n())
groupsize

#Test suitability of the inner resampling
outer = data.frame(ID = as.integer(sample(nID, length(nID), replace = T)), Ntry = 
seq(1,length(nID),1))
outerdata = left_join(outer, df, by = "ID", relationship = "many-to- 
many")

df2 = outerdata %>%
  group_by(Ntry, ID)%>%
  summarise(n = n())%>%
  left_join(groupsize, by = "ID")%>%
  mutate(diff = n.x - n.y)
mean(df2$diff) #Same number of row (individuals) per try

df3 = outerdata %>%  
  left_join(groupsize, by = "ID") %>%
  group_by(Ntry) %>%
  mutate(Sp = sample(Species.name, replace = T)) %>%
  filter(length(Sp) == n) %>%
  ungroup()
df3

check = df3 %>%
  group_by(ID, Ntry)%>%
  summarise(nfacor = length(unique(Species.name)),
            nfacsamp = length(unique(Sp)))

check$V = ifelse(check$nfacsamp <= check$nfacor, "OK", "FALSE")
check #no more unique species in resampling than in observed


#Create the loop
inner <- NULL
outer <- NULL
complete <- NULL
for (i in 1:1000) {
  #Outer resampling
  outer = data.frame(ID = as.integer(sample(nID, length(nID), replace = T), Ntry = seq(1,length(nID),1))
  outerdata = left_join(outer, df, by = "ID", relationship = "many-to-many")
  
  
  #Inner resampling
  inner = outerdata %>%  
    left_join(groupsize, by = "ID") %>%
    group_by(Ntry) %>%
    mutate(Sp = sample(Species.name, replace = T)) %>%
    filter(length(Sp) == n) %>%
    ungroup()
  inner = inner[,c("Ntry", "Sp")]
  
  #Adding missing column (i.e missing species since it is possible that 
  #some species are not resampled in a given iteration)
  sampcol <- colnames(acast(inner, Ntry ~ Sp, fun.aggregate = length))
  misscol = sub(" ", "\\.", setdiff(as.character(sptot), sampcol))
  complete = data.frame(Ntry= rownames(acast(inner, Ntry ~ Sp, fun.aggregate = length)), acast(inner, Ntry ~ Sp, fun.aggregate = length), bloc = i)
  complete[,misscol] <- 0
}
complete

Then we obtain a 1000 x 100 rows dataframe, with number of columns corresponding to species, ideal to calculate diversity index. The annoying thing is the error message from acast() but it does not impact the result.

Following Herrmann et al. (2022), we now need to : 1) sum columns per block in order to obtain 1000 rows, each pooling count of species within resampled tries and 2) then calculated 1000 values for each diversity index, and get Efron 95% CI.

Hope this thread will help somebody to perform similar analysis !