3

I'm having issues with knitr. Specifically, I have a model which runs absolutely fine in the console but when I try and knit the document, R throws an error.

Load the dataset (available here to facilitate replication )

scabies <- read.csv(file = "S1-Dataset_CSV.csv", header = TRUE, sep = ",")
scabies$agegroups <- as.factor(cut(scabies$age, c(0,10,20,Inf), labels = c("0-10","11-20","21+"), include.lowest = TRUE)) 
scabies$agegroups <-relevel(scabies$agegroups, ref = "21+")
scabies$house_cat <- as.factor(cut(scabies$house_inhabitants, c(0,5,10,Inf), labels = c("0-5","6-10","10+"), include.lowest = TRUE))
scabies$house_cat <- relevel(scabies$house_cat, ref = "0-5")
scabies <- scabies %>% mutate(scabies = case_when(scabies_infestation=="yes"~1,
                                                  scabies_infestation=="no"~0)) %>%
                      mutate(impetigo = case_when(impetigo_active=="yes" ~1,
                                                  impetigo_active=="no" ~0))

fit the model

scabiesrisk <- glm(scabies~agegroups+gender+house_cat,data=scabies,family=binomial())
scabiesrisk_OR <- exp(cbind(OR= coef(scabiesrisk), confint(scabiesrisk)))
scabiesrisk_summary <- summary(scabiesrisk)
scabiesrisk_summary <- cbind(scabiesrisk_OR, scabiesrisk_summary$coefficients)
scabiesrisk_summary

This code runs absolutely fine in the Console. But when I try knitr I get:

Error in model.frame.default(formula = scabies ~ agegroups + gender + : invalid type(list) for variable 'scabies Calls: ... glm -> eval -> eval -> -> model.frame.default

Ben Bolker
  • 211,554
  • 25
  • 370
  • 453
mmarks
  • 1,119
  • 7
  • 14
  • 1
    Have yo checked when you `knitr` that the `scabies` dataset just before applying the `glm` is right? – iago Aug 27 '20 at 12:58
  • 2
    I can `knitr` your code without any problem... – iago Aug 27 '20 at 13:04
  • Hi Iago - what do you mean by your first comment? – mmarks Aug 28 '20 at 13:03
  • If the `scabies` dataset is what it has to be and not a wrong, for example empty, dataset, after running all the commands previous to `glm`? – iago Aug 28 '20 at 13:12
  • The script/markdown is as above. The dataset is clearly correct as it runs in console but wont knit in Markdown – mmarks Aug 29 '20 at 07:48
  • I can't replicate this. I put your code in a `.Rmd` file (with a `library(tidyverse)` at the top) and run `rmarkdown::render()` on it, it works fine. So I can't help any further ... – Ben Bolker Sep 14 '20 at 20:35
  • Thanks @BenBolker so strange! – mmarks Sep 15 '20 at 21:03

2 Answers2

5

I was able to reproduce the problem you describe, but haven't yet fully understood what happens under the hood.
This Markdown chunck is interesting :

```{r}
scabiesrisk_OR <- exp(cbind(OR= coef(scabiesrisk), confint((scabiesrisk))))
scabiesrisk_summary <- summary(scabiesrisk)
scabiesrisk_summary <- cbind(scabiesrisk_OR, scabiesrisk_summary$coefficients)
scabiesrisk_summary
```

If I manually quickly execute the lines in the chunck one after another (ctrl+Enter x 4), sometimes I get two profiling messages:

Waiting for profiling to be done...
Waiting for profiling to be done...

In this case, summary(scabiesrisk) is a matrix:

> class(scabiesrisk_summary)
[1] "matrix" "array" 

If I manually slowly execute the lines in the chunk, I get only one profiling message:

Waiting for profiling to be done...

summary(scabiesrisk) is a summary.glm :

> class(scabiesrisk_summary)
[1] "summary.glm"

Looks like profiling is launched on a separate thread, and depending on whether it was finished or not, summary function doesn't have the same behaviour. If profiling is finished, it returns the expected summary.glm object, but if it isn't the case it launches another profiling and returns a matrix.
In particular, with a matrix scabiesrisk_summary$coefficients isn't available and I get in this situation the following error message:

Error in scabiesrisk_summary$coefficients : 
  $ operator is invalid for atomic vectors

enter image description here

This could possibly also happen while knitting : does knitting overhead make profiling slower so that the problem occurs?

With the workaround found here (use confint.defaultinstead of confint), I wasn't able to reproduce the above problem:

scabiesrisk_OR <- exp(cbind(OR= coef(scabiesrisk), confint.default((scabiesrisk))))
scabiesrisk_summary <- summary(scabiesrisk)
scabiesrisk_summary <- cbind(scabiesrisk_OR, scabiesrisk_summary$coefficients)
scabiesrisk_summary
                       OR      2.5 %    97.5 %   Estimate Std. Error
(Intercept)    0.09357141 0.06984512 0.1253575 -2.3690303  0.1492092
agegroups0-10  2.20016940 1.60953741 3.0075383  0.7885344  0.1594864
agegroups11-20 2.53291768 1.79985894 3.5645415  0.9293719  0.1743214
gendermale     1.44749159 1.13922803 1.8391682  0.3698321  0.1221866
house_cat6-10  1.30521927 1.02586104 1.6606512  0.2663710  0.1228792
house_cat10+   1.17003712 0.67405594 2.0309692  0.1570355  0.2813713
                   z value     Pr(>|z|)
(Intercept)    -15.8772359 9.110557e-57
agegroups0-10    4.9442116 7.645264e-07
agegroups11-20   5.3313714 9.747386e-08
gendermale       3.0267824 2.471718e-03
house_cat6-10    2.1677478 3.017788e-02
house_cat10+     0.5581076 5.767709e-01

So you could also probably try this in your case.

Contrary to confint.defaut which is a directly readable R function, confint is a S3 dispatch method (thanks @Ben Bolker for the internal references in comments), and I didn't yet investigate further what could explain this surprising behaviour.

Another option seems to save scabiesrisk_summary in another variable.
I tried hard but was never able to reproduce the problem after doing so :

```{r}
scabiesrisk_OR <- exp(cbind(OR= coef(scabiesrisk), confint((scabiesrisk))))
scabiesrisk_summary <- summary(scabiesrisk)
scabiesrisk_final <- cbind(scabiesrisk_OR, scabiesrisk_summary$coefficients)
scabiesrisk_final
```
Waldi
  • 39,242
  • 6
  • 30
  • 78
  • 1
    This is all pretty surprising. `confint` is an S3 generic dispatch method. `stats:::confint.glm` is a weird little stub that retrieves the `MASS:::confint.glm` function and runs it. `confint.glm` actually produces a `profile.glm` object and calls `MASS:::confint.profile.glm`. I can't reproduce the behaviour you describe here ... – Ben Bolker Sep 07 '20 at 20:35
  • @Ben Bolker, thank for trying and for the reference to the underlying functions. This probably depends on the hardware you're on, the speed you type 4 x Ctrl + enter in the chunck. I don't succeed every time to reproduce the problem, but when I try a few times it always appears (R4.0.2 / Win 10). – Waldi Sep 07 '20 at 20:52
  • @Ben Bolker, if instead of assigning `cbind(scabiesrisk_OR, scabiesrisk_summary$coefficients)` to `scabiesrisk_summary` I assign it to another variable name, I'm unable to reproduce the problem. Not sure what happens here but looks like shared memory being written while read. – Waldi Sep 07 '20 at 21:25
  • this is still moderately surprising, as I don't know of anything in the underlying code that would be thread-unsafe ... I do believe you, I just can't reproduce it. – Ben Bolker Sep 07 '20 at 21:29
  • @mmarks, interested to know whether one of the two above workarounds make knit work in your case. – Waldi Sep 08 '20 at 19:18
  • This doesnt fix it - certainly when I change to scabiesrisk_final i still get a knitr issue. I think as per @BenBolker below it is something to do with tidyverse but I'm confused why it happens at line 840 in my overall script when I have called tidyverse multiple times earlier in the script. – mmarks Sep 14 '20 at 19:28
  • Thanks for the feedback, sorry it didn't help, not an easy problem. Hopefully you'll find out. – Waldi Sep 14 '20 at 20:51
3

I strongly suspect that you forgot to include library(tidyverse) in your script. If tidyverse is loaded, then your code works fine. If it's not:

  • the step where you try to mutate() (and use %>%) fails, so the scabies variable is never created within the scabies data set
  • glm(scabies ~ ...) then interprets the response variable scabies as being the whole data set, and complains that the response variable is "invalid type(list)".

For this reason it's good practice to avoid having variables within data frames that have the same name as the data frames themselves ...

Your data transformation steps can be cleaned up a little bit (as.factor() is redundant; you can do all of the transformations as steps within a single mutate() call; as.numeric(x=="yes") is a shorter way to turn a string into a 0/1 variable ...) If I were going to do a lot more of this I would write a custom mycut() function that took breakpoints and a desired reference level as input arguments, constructed custom labels, and did the releveling.

library(tidyverse)
scabies <- (read.csv(file = "S1-Dataset_CSV.csv") %>%
            mutate(agegroups <- cut(age, c(0,10,20,Inf),
                                    labels = c("0-10","11-20","21+"),
                                    include.lowest = TRUE),
                   agegroups = relevel(agegroups, ref = "21+"),
                   house_cat = cut(house_inhabitants, c(0,5,10,Inf),
                                   labels = c("0-5","6-10","10+"),
                                   include.lowest = TRUE),
                   house_cat = relevel(house_cat, ref = "0-5"),
                   scabies = as.numeric(scabies_infestation=="yes"),
                   impetigo = as.numeric(impetigo_active=="yes"))
)
Ben Bolker
  • 211,554
  • 25
  • 370
  • 453
  • I call library(tidyverse) right at the beginning of the script and dont think this is the cause because several earlier chunks within the markdown script have tidyverse dependent commands which run and then it only breaks here (at what in the overall script is line 839 of the overall script) The overall markdown (which is a rough guide of my own) is here: https://www.dropbox.com/s/1g7p159ns5267yy/R%20for%20Clinical%20Epi.Rmd?dl=0 If the issue was tidyverse then I would expect several of the earlier chunks to fail – mmarks Sep 08 '20 at 20:32
  • Actually I think this is the issue but I still dont full understand it. I changed the name of the variable to scabies_present and now i get object scabies_present not found. But them i am confused why the earlier tidyverse command/chunks in the markdown do run and then this one fails? – mmarks Sep 08 '20 at 20:41