0

I have wanted to see contrasts inside a specified model:

is_service ~ action_count * document_entropy

The full dataset is loaded in the code.

Overall the data are these:

> str(dat)
'data.frame':   6432 obs. of  3 variables:
 $ action_count    : num  0.0759 0.1505 0.1435 0.1535 0.2067 ...
 $ document_entropy: num  -0.667 -0.667 -0.667 -0.667 -0.667 ...
 $ is_service      : int  0 0 0 0 0 0 0 0 0 0 ...

The target column has this binomial distribution:

> table(dat$is_service)

   0    1 
6291  141 

Input columns are z-normalized and distributed as follows:

enter image description here

enter image description here

It is interesting to see that when I fit this model (1st part of the code) the procedure ends without a warnings.

However, when I run contrasts with the stats::anova (2nd part of code) it does return warnings.

Question: Why is that happening, and which level is more alarming: single model or the anova analysis of it?

list.of.packages <- c('RCurl')
new.packages <- list.of.packages[!(list.of.packages %in% installed.packages()[,"Package"])]
if(length(new.packages)) install.packages(new.packages)

library(RCurl)

x <- getURL("https://rawgit.com/alexmosc/FX_Big_Experiment/master/service_train_saved.csv")
dat <- read.csv(text = x)
dat$X <- NULL

str(dat)
# first part
summary(
     glm(formula = is_service ~ action_count * document_entropy
         , family = binomial(link = 'logit'),
         data = dat
     )
)
# second part
anova(
     glm(formula = is_service ~ 1
         , family = binomial(link = 'logit')
         , data = dat
     )
     , glm(formula = is_service ~ action_count
           , family = binomial(link = 'logit')
           , data = dat
     )
     , glm(formula = is_service ~ action_count + document_entropy
           , family = binomial(link = 'logit')
           , data = dat
     )
     , glm(formula = is_service ~ action_count + document_entropy + action_count:document_entropy
           , family = binomial(link = 'logit')
           , data = dat
     )
     , test = "Chisq"
)
Alexey Burnakov
  • 259
  • 2
  • 14
  • 1
    It would be easier to help if you provided a [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) with sample input data so we could run and test the code. – MrFlick Oct 23 '17 at 17:24
  • @ MrFlick, thank you. I updated my question. – Alexey Burnakov Oct 23 '17 at 17:47
  • 1
    I don't think this is really programming related. The warnings only from from the middle two models in your `anova()` list and doesn't occur when you fit the full model. This warning is discussed here: https://stats.stackexchange.com/questions/11109/how-to-deal-with-perfect-separation-in-logistic-regression. For better model suggestions this might be a better fit for [stats.se] since the problem is really statistical in nature. – MrFlick Oct 23 '17 at 19:29
  • Alright, I see there can be problems with convergence as the data are unbalanced and heavy tailed. It has striken me as a surprise that I did not get warnings while fitting one model wih glm. I suppose this more R related, but I might be wrong. Is the anova warning alarming as much as the pure glm would be? Are you moderating and able to migrate my questuon to CrossVidated? Thank you for your answer. – Alexey Burnakov Oct 23 '17 at 20:31
  • I started the question at CV: https://stats.stackexchange.com/questions/309595/is-there-a-specific-reason-why-r-glm-does-not-return-warnings-while-anovaglm – Alexey Burnakov Oct 24 '17 at 09:09

0 Answers0