2

I have a dataset in which I am measuring area of between 10-40 microglia for some 25ish subjects, each measured in 3 different slices of tissue. I want to do hierarchical clustering to ask if the area differs between VEH M, VEH F, MOR M, and MOR F. I have attempted to arrange my data as follows:

https://docs.google.com/spreadsheets/d/1oueD6y6u4Jm4cYg-OKqau_GMQtGxmnKW/edit?usp=sharing&ouid=117418626760035323883&rtpof=true&sd=true

so that I can use it with the ClusterBootstrap package.

I ran the following code:

clusbootglm(Area ~ Microglia*Treatment, data = area_test, clusterid = Animal_Slice, family = gaussian, B = 5000, confint.level = 0.95, n.cores = 1)

But that returns coefficients for the model for each individual microglia, which seems incorrect. I've been trying to follow the instructions from their paper (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7148287/) and the documentation (https://cran.r-project.org/web/packages/ClusterBootstrap/ClusterBootstrap.pdf) but I don't really understand how to represent the data correctly. I also tried to use the pvclust package, but the output didn't seem to answer the questions I had (how is area different between the treatment groups). Any help would be appreciated.

  • Clustering? It is ot typically used to understand the relationship or difference between variables as you would in a regression model. This seems more like a task for statistical tests such as ANOVA or t-tests, rather than clustering, no? – VonC Jun 15 '23 at 19:47
  • Normally, yes, but it is incorrect to use ANOVA on this type of data because having multiple values for each animal violates the assumption of independence – Hannah Harder Jun 16 '23 at 15:16

1 Answers1

0

Reading "ClusterBootstrap: An R package for the analysis of hierarchical data using generalized linear models with the cluster bootstrap" (from Mathijs Deen & Mark de Rooij), the clusbootglm function from the ClusterBootstrap package is used to fit generalized linear models with the cluster bootstrap for analysis of clustered data.
The function is designed to handle data where observations are not independent, but are clustered or grouped in some way.

In your case, the clusters are defined by the Animal_Slice variable.

The formula you are using in clusbootglm (Area ~ Microglia*Treatment) is specifying a model where the area is predicted by both the microglia and the treatment, and their interaction. This is why you are getting coefficients for each individual microglia.

If you want to test if the area differs between the treatment groups (VEH M, VEH F, MOR M, and MOR F), you should specify these groups in your model. Assuming these groups are represented in your Treatment variable, your formula could be Area ~ Treatment. This will give you coefficients comparing each treatment group to the reference group (the first level of the Treatment variable).

However, if your Treatment variable does not represent these groups, you will need to create a new variable that does. This could be done with the mutate function from the dplyr package, for example:

area_test <- area_test %>%
  mutate(Treatment_Group = case_when(
    Treatment == "VEH" & Sex == "M" ~ "VEH M",
    Treatment == "VEH" & Sex == "F" ~ "VEH F",
    Treatment == "MOR" & Sex == "M" ~ "MOR M",
    Treatment == "MOR" & Sex == "F" ~ "MOR F"
  ))

Then you could use this new variable in your model: Area ~ Treatment_Group.

Remember, the clusbootglm function is not performing clustering in the sense of hierarchical or k-means clustering. It is fitting a generalized linear model, taking into account the clustered nature of your data. The "cluster" in clusbootglm refers to the clusters of observations, not to clusters found in a clustering algorithm.

How is area different between the treatment groups?

You would need to fit a model that predicts Area based on the Treatment variable (or Treatment_Group as seen above)). This can be done using the clusbootglm function, as you have been doing. However, you should adjust your formula to Area ~ Treatment to focus on the treatment groups.

clusbootglm(Area ~ Treatment_Group, data = area_test, clusterid = Animal_Slice, family = gaussian, B = 5000, confint.level = 0.95, n.cores = 1)

That is fitting a generalized linear model where the response variable is Area and the predictor is Treatment_Group.
The data is coming from the area_test dataframe and the clusters are identified by the Animal_Slice variable.
The error distribution used in the model is Gaussian. The number of bootstrap samples (B) is set to 5000.
The confidence interval level (confint.level) is set to 0.95 (or 95%), and the number of CPU cores (n.cores) used for the computation is 1.

This will return a set of coefficients, one for each level of the Treatment_Group variable (except the reference level). These coefficients represent the difference in the predicted Area between the corresponding treatment group and the reference group.

If the coefficient for a treatment group is positive, it means that the Area is larger for that group compared to the reference group. If the coefficient is negative, it means the Area is smaller. The magnitude of the coefficient indicates the size of the difference.

To interpret these results, you will need to know which level of the Treatment_Group variable is the reference level. This is typically the first level in alphabetical order, but it can be changed using the relevel function.

Remember, these results are based on the model you have specified and the data you have. They do not prove that the treatment causes a difference in Area, only that there is an association between treatment and Area in your data. Other factors could also be influencing Area.

Also, keep in mind that the clusbootglm function is performing a bootstrap procedure, which involves random sampling with replacement. This means that the results can vary slightly each time you run the function. To get consistent results, you can set a seed before running the function using the set.seed function, as explained in "Reasons for using the set.seed function".


So you might need:

  • Data preparation: for instance, creating a new variable Treatment_Group using the mutate() function to combine the information about treatment and sex. This new variable will be used in the model instead of separate treatment and sex variables.
area_test <- area_test %>%
  mutate(Treatment_Group = case_when(
    Treatment == "VEH" & Sex == "M" ~ "VEH M",
    Treatment == "VEH" & Sex == "F" ~ "VEH F",
    Treatment == "MOR" & Sex == "M" ~ "MOR M",
    Treatment == "MOR" & Sex == "F" ~ "MOR F"
  ))
  • Model specification: Instead of using hierarchical clustering, I suggested fitting a generalized linear model with the cluster bootstrap using the clusbootglm function. The model will have Area as the dependent variable and Treatment_Group as the independent variable. The clusterid argument is set to Animal_Slice to account for the clustered nature of the data.
clusbootglm_result <- clusbootglm(Area ~ Treatment_Group, data = area_test, clusterid = Animal_Slice, R = 1000)

(default error distribution. The number of bootstrap samples is set to 1000 (specified by R). But you can also reuse the parameters mentioned above)

  • Model interpretation: After fitting the model, you can interpret the results to understand the differences in microglia area between the treatment groups. You can use the summary() function to obtain the coefficients and p-values for the model.
summary(clusbootglm_result)

You can then analyze your data using clusbootglm to account for the clustered nature of the data and investigate the differences in microglia area between the treatment groups.

VonC
  • 1,262,500
  • 529
  • 4,410
  • 5,250
  • Thank you so so much for this answer. It was extremely helpful!! Can I reach out to you if I have any follow up questions? I usually have a pretty good grasp of statistics and coding but something about this data structure and package is throwing me off. I have given you the bounty as well!! – Hannah Harder Jun 17 '23 at 20:48
  • @HannahHarder Sure, ask new questions here (on Stack Overflow), and ping me: that way, others can help, and everybody can learn (including me). – VonC Jun 18 '23 at 04:55