Reading "ClusterBootstrap: An R package for the analysis of hierarchical data using generalized linear models with the cluster bootstrap" (from Mathijs Deen & Mark de Rooij), the clusbootglm
function from the ClusterBootstrap package is used to fit generalized linear models with the cluster bootstrap for analysis of clustered data.
The function is designed to handle data where observations are not independent, but are clustered or grouped in some way.
In your case, the clusters are defined by the Animal_Slice
variable.
The formula you are using in clusbootglm
(Area ~ Microglia*Treatment
) is specifying a model where the area is predicted by both the microglia and the treatment, and their interaction. This is why you are getting coefficients for each individual microglia.
If you want to test if the area differs between the treatment groups (VEH M, VEH F, MOR M, and MOR F), you should specify these groups in your model. Assuming these groups are represented in your Treatment
variable, your formula could be Area ~ Treatment
. This will give you coefficients comparing each treatment group to the reference group (the first level of the Treatment
variable).
However, if your Treatment
variable does not represent these groups, you will need to create a new variable that does. This could be done with the mutate
function from the dplyr package, for example:
area_test <- area_test %>%
mutate(Treatment_Group = case_when(
Treatment == "VEH" & Sex == "M" ~ "VEH M",
Treatment == "VEH" & Sex == "F" ~ "VEH F",
Treatment == "MOR" & Sex == "M" ~ "MOR M",
Treatment == "MOR" & Sex == "F" ~ "MOR F"
))
Then you could use this new variable in your model: Area ~ Treatment_Group
.
Remember, the clusbootglm
function is not performing clustering in the sense of hierarchical or k-means clustering. It is fitting a generalized linear model, taking into account the clustered nature of your data. The "cluster" in clusbootglm
refers to the clusters of observations, not to clusters found in a clustering algorithm.
How is area different between the treatment groups?
You would need to fit a model that predicts Area
based on the Treatment
variable (or Treatment_Group
as seen above)). This can be done using the clusbootglm
function, as you have been doing. However, you should adjust your formula to Area ~ Treatment
to focus on the treatment groups.
clusbootglm(Area ~ Treatment_Group, data = area_test, clusterid = Animal_Slice, family = gaussian, B = 5000, confint.level = 0.95, n.cores = 1)
That is fitting a generalized linear model where the response variable is Area
and the predictor is Treatment_Group
.
The data is coming from the area_test
dataframe and the clusters are identified by the Animal_Slice
variable.
The error distribution used in the model is Gaussian. The number of bootstrap samples (B
) is set to 5000.
The confidence interval level (confint.level
) is set to 0.95 (or 95%), and the number of CPU cores (n.cores
) used for the computation is 1.
This will return a set of coefficients, one for each level of the Treatment_Group
variable (except the reference level). These coefficients represent the difference in the predicted Area
between the corresponding treatment group and the reference group.
If the coefficient for a treatment group is positive, it means that the Area
is larger for that group compared to the reference group. If the coefficient is negative, it means the Area
is smaller. The magnitude of the coefficient indicates the size of the difference.
To interpret these results, you will need to know which level of the Treatment_Group
variable is the reference level. This is typically the first level in alphabetical order, but it can be changed using the relevel
function.
Remember, these results are based on the model you have specified and the data you have. They do not prove that the treatment causes a difference in Area
, only that there is an association between treatment and Area
in your data. Other factors could also be influencing Area
.
Also, keep in mind that the clusbootglm
function is performing a bootstrap procedure, which involves random sampling with replacement. This means that the results can vary slightly each time you run the function. To get consistent results, you can set a seed before running the function using the set.seed
function, as explained in "Reasons for using the set.seed
function".
So you might need:
- Data preparation: for instance, creating a new variable
Treatment_Group
using the mutate()
function to combine the information about treatment and sex. This new variable will be used in the model instead of separate treatment and sex variables.
area_test <- area_test %>%
mutate(Treatment_Group = case_when(
Treatment == "VEH" & Sex == "M" ~ "VEH M",
Treatment == "VEH" & Sex == "F" ~ "VEH F",
Treatment == "MOR" & Sex == "M" ~ "MOR M",
Treatment == "MOR" & Sex == "F" ~ "MOR F"
))
- Model specification: Instead of using hierarchical clustering, I suggested fitting a generalized linear model with the cluster bootstrap using the
clusbootglm
function. The model will have Area
as the dependent variable and Treatment_Group
as the independent variable. The clusterid
argument is set to Animal_Slice
to account for the clustered nature of the data.
clusbootglm_result <- clusbootglm(Area ~ Treatment_Group, data = area_test, clusterid = Animal_Slice, R = 1000)
(default error distribution. The number of bootstrap samples is set to 1000 (specified by R
). But you can also reuse the parameters mentioned above)
- Model interpretation: After fitting the model, you can interpret the results to understand the differences in microglia area between the treatment groups. You can use the
summary()
function to obtain the coefficients and p-values for the model.
summary(clusbootglm_result)
You can then analyze your data using clusbootglm
to account for the clustered nature of the data and investigate the differences in microglia area between the treatment groups.