1

How can I specify a more complex data structure than a simple ID column?

If I have a glmertree model, how can I specify (e.g.) a cross classified model in the cluster covariance tests?

tree_1 <- 
  glmertree(
    data = sim_dat, 
    formula = 
      performance ~ 1 + predictors | 
      (1 | student_id) + (1 | question_number) | 
      partitioning_variables, 
    family = 'binomial',
    cluster = ???
  )

Or how about in a simple nested design?

tree_2 <- 
  lmertree(
    data = sim_dat, 
    formula = 
      test_score ~ 1 + predictors | 
      (1 | district/school) | 
## equivalent to (1|school:district) + (1|district)
      partitioning_variables, 
    cluster = ???
  )

So far, I've fit models with cluster covariance tests on whatever level has the greatest variance in the outcome, but fitting the proper structure seems more appropriate if possible.

Thanks!

Chrr1s
  • 23
  • 4
  • This doesn't appear to be a specific programming question that's appropriate for Stack Overflow. If you have general questions about the appropriate use of various statistical methods, then you should ask such questions over at [stats.se] instead. You are more likely to get better answers there. – MrFlick Sep 29 '21 at 19:25
  • @MrFlick, my apologies. Maybe I'm being unclear; I'm not asking if it's possible, I'm asking how to specify it in the packages. – Chrr1s Sep 29 '21 at 19:28
  • So you know already it's possible? If so it might be nice to to link to that documentation. It's easier to help you if you include a simple [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) with sample input and desired output that can be used to test and verify possible solutions. – MrFlick Sep 29 '21 at 19:33
  • @MrFlick, I'm not sure how I would produce the desired output in this case, which is why I'm asking. I'm pretty new to asking for help online. Here is a paper that does something akin to what I'm trying to do (https://cran.r-project.org/web/packages/pcse/vignettes/pcse.pdf), but their adjustments are of standard errors after lm(). I'm looking to apply this within {glmertree} (or {partykit}, etc). – Chrr1s Sep 29 '21 at 19:44
  • This is from the same author of partykit, so probably even more useful for documentation of these complex cluster covariance models being implemented in one version or another https://cran.r-project.org/web/packages/sandwich/vignettes/sandwich-CL.pdf. Regardless, I can try putting this in cross validated if you still think that's a better place – Chrr1s Sep 29 '21 at 19:53
  • @Chrr1s Can you provide some more information about at what level the partitioning variables are measured? In tree_1, at the student and/or the question level? In tree_2, at the school or district level? – Marjolein Fokkema Oct 07 '21 at 23:50
  • @MarjoleinFokkema-- thank you for your reply! In both cases, the partitioning variables are at both levels. So for tree_1, we'd like to include variables about the question (e.g., a categorical "problem type" variable, or a dichotomous variable about if the student had seen the question before on a practice test). Additionally, we have student demographics and some biological measurements (e.g., first year vs. third year college student and skin conductance). – Chrr1s Oct 08 '21 at 18:31
  • Similarly with tree_2, we have information about both schools and districts. A school can only be in one district. For example, we have school-level covariates (e.g., amount of funding a school receives) and district-level (e.g., district policy variables). I had to split these so they would fit in a comment, apologies. – Chrr1s Oct 08 '21 at 18:32

1 Answers1

2

I hope I understand your question correctly; as per my comment to your question above, some more info might be helpful. This is a preliminary answer:

The cluster argument should be specified, so that the parameter stability tests will be performed at the right level. In most (but not all) cases, I would expect this to be only a single level, and thus only a single clustering variable needs to be passed to the cluster argument.

In tree_1, if all partitioning variables are measured on the same level (i.e., all are characteristics of either the students, or the questions), then you specify either cluster = question or cluster = student. If some partitioning variables are measured on the student level, and some partitioning variables are measured on the question level, it's going to be more complex, and we will have to look into that (I am one of the package authors).

In tree_2, I assume that a single school can only be part of a single district. If all partitioning variables are measured on the district level, you specify cluster = district. If all partitioning variables are measured on the school level, then make sure that the school variable has a unique identifier for each school, and specify cluster = school. If a single school can be part of multiple districts, and partitioning variables are measured at both district and school level, then we will have to look into that.