1

I understand it's a well-known classic problem, but despite my research I failed to fixed this problem with my data.

I have this kind of data:

df

                         SNP Site  Color Frequence
1  scaffold10000|size69197_10061    K  Green 0.4404348
2  scaffold10000|size69197_10061    G  Green 0.6700000
3  scaffold10000|size69197_10061    G    Red 0.7171429
4  scaffold10000|size69197_10061    K Yellow 0.7937500
5  scaffold10000|size69197_10061    T Yellow 0.7202174
6  scaffold10000|size69197_10061    E    Red 0.7373469
7  scaffold10000|size69197_10061    G Yellow 0.6150000
8  scaffold10000|size69197_10061    T    Red 0.5668750
9  scaffold10000|size69197_10061    K    Red 0.6190385
10 scaffold10000|size69197_10061    T  Green 0.5629412
11 scaffold10000|size69197_10061    E Yellow 0.8312500
12 scaffold10000|size69197_10061    E  Green 0.5474286

And I want to known if there are statistical differences between the three colours and the four sites for this SNP (named "scaffold10000|size69197_10061"). I want to ponderate these variables (the 3 colours and the 4 sites), this is why I choose glm() function.

model <- glm(formula = Frequence  ~  Color  + Site, family=quasibinomial(), data=df) 

And this give me these coefficients:

Coefficients:
            Estimate Std. Error t value Pr(>|t|)  
(Intercept)   0.2905     0.3105   0.936   0.3856  
ColorRed      0.4450     0.3084   1.443   0.1991  
ColorYellow   0.8298     0.3215   2.581   0.0417 *
SiteT        -0.2268     0.3644  -0.622   0.5566  
SiteK        -0.2221     0.3645  -0.609   0.5646  
SiteE         0.1809     0.3760   0.481   0.6475  
---

So the Green and the G site doesn't appears (because both are categorical if I understood correctly).

According to these issues in R blogger and in Stackoverflow I understand how to remove the intercept (to get a model easier to understand) in adding -1 or + 0 in the formula.

model <- glm(formula = Frequence  ~  Color  + Site - 1, family=quasibinomial(), data=df) 

So I at least one categorical variable appears:

Coefficients:
            Estimate Std. Error t value Pr(>|t|)  
ColorGreen    0.2905     0.3105   0.936   0.3856  
ColorRed      0.7355     0.3185   2.309   0.0603 .
ColorYellow   1.1202     0.3319   3.376   0.0149 *
SiteT        -0.2268     0.3644  -0.622   0.5566  
SiteK        -0.2221     0.3645  -0.609   0.5646  
SiteE         0.1809     0.3760   0.481   0.6475  

I failed to encode something in order to the 4th site appears

Firstly I try to merge 2 different models:

model1 <- glm(formula = Frequence  ~  Site - 1, family=quasibinomial(), data=df) 
model2 <- glm(formula = Frequence  ~  Color - 1, family=quasibinomial(), data=df) 

with different ways but didn't work (and may didn't make sense..)

Put others -1 or + 0 didn't work neither:

model <- glm(formula = Frequence  ~  0 + Color  + Site - 1, family=quasibinomial(), data=df) 

According the answer to this similar issue (and also this about lm(), simply add two sum-to-zero constrains on parameters:

contrasts(ok$Site) <- contr.sum(4, contrasts=F)
contrasts(ok$Color) <- contr.sum(3, contrasts=F)

or use this (I don't remember on each tread on Stackoverflow)

relevel(ok$Site, "E")
relevel(ok$Site, "T")
relevel(ok$Site, "K")
relevel(ok$Site, "G")

and re-run the model. But these two possibilities also failed.

So I try to split the data.frame in order to had manually the variables into the model:

df2
                              SNP Site  Color Frequence Green Yellow   Red     K     G     T     E
 1  scaffold10000|size69197_10061    K  Green 0.4404348  TRUE  FALSE FALSE  TRUE FALSE FALSE FALSE
 2  scaffold10000|size69197_10061    G  Green 0.6700000  TRUE  FALSE FALSE FALSE  TRUE FALSE FALSE
 3  scaffold10000|size69197_10061    G    Red 0.7171429 FALSE  FALSE  TRUE FALSE  TRUE FALSE FALSE
 4  scaffold10000|size69197_10061    K Yellow 0.7937500 FALSE   TRUE FALSE  TRUE FALSE FALSE FALSE
 5  scaffold10000|size69197_10061    T Yellow 0.7202174 FALSE   TRUE FALSE FALSE FALSE  TRUE FALSE
 6  scaffold10000|size69197_10061    E    Red 0.7373469 FALSE  FALSE  TRUE FALSE FALSE FALSE  TRUE
 7  scaffold10000|size69197_10061    G Yellow 0.6150000 FALSE   TRUE FALSE FALSE  TRUE FALSE FALSE
 8  scaffold10000|size69197_10061    T    Red 0.5668750 FALSE  FALSE  TRUE FALSE FALSE  TRUE FALSE
 9  scaffold10000|size69197_10061    K    Red 0.6190385 FALSE  FALSE  TRUE  TRUE FALSE FALSE FALSE
 10 scaffold10000|size69197_10061    T  Green 0.5629412  TRUE  FALSE FALSE FALSE FALSE  TRUE FALSE
 11 scaffold10000|size69197_10061    E Yellow 0.8312500 FALSE   TRUE FALSE FALSE FALSE FALSE  TRUE
 12 scaffold10000|size69197_10061    E  Green 0.5474286  TRUE  FALSE FALSE FALSE FALSE FALSE  TRUE

(The TRUE and FALSE can be change into 0 and 1 with df2[df2=="FALSE"]<-0.

  model <- glm(formula=Frequence  ~  Red + Green + Yellow + K + T + E + G -1, family=quasibinomial(), data=df2)

Now, all variables are in coefficient:

Coefficients: (2 not defined because of singularities)
           Estimate Std. Error t value Pr(>|t|)  
RedFALSE     1.1202     0.3319   3.376   0.0149 *
RedTRUE      0.7355     0.3185   2.309   0.0603 .
GreenTRUE   -0.8298     0.3215  -2.581   0.0417 *
YellowTRUE       NA         NA      NA       NA  
KTRUE       -0.2221     0.3645  -0.609   0.5646  
TTRUE       -0.2268     0.3644  -0.622   0.5566  
ETRUE        0.1809     0.3760   0.481   0.6475  
GTRUE            NA         NA      NA       NA 

But NA appears now.

According to this issue in Stackexchange, I checked if the model matrix has full rank, and the answer is no.

# Get model matrix ...
X <- model.matrix(~ Red + Green + Yellow + K  + T + E + G - 1, family=quasibinomial(), data=as.data.frame(ok))
> X
   RedFALSE RedTRUE GreenTRUE YellowTRUE KTRUE TTRUE ETRUE GTRUE
1         1       0         1          0     1     0     0     0
2         1       0         1          0     0     0     0     1
3         0       1         0          0     0     0     0     1
4         1       0         0          1     1     0     0     0
5         1       0         0          1     0     1     0     0
6         0       1         0          0     0     0     1     0
7         1       0         0          1     0     0     0     1
8         0       1         0          0     0     1     0     0
9         0       1         0          0     1     0     0     0
10        1       0         1          0     0     1     0     0
11        1       0         0          1     0     0     1     0
12        1       0         1          0     0     0     1     0


# Get rank of model matrix
qr(X)$rank
> 6


# Get number of parameters of the model = number of columns of model matrix
ncol(X)
> 8

So if there is no -1, the first column of X is the intercept and if there is the -1 the Red column is duplicate (one for TRUE and one for FALSE).

So, there is 8 column and 6 rank. Normally I should have had rather 14 columns and 14 rank no ? (7 variables (3 colours and 4 sites) * 2 (TRUE or FALSE))

So, how can I encode my model in order to force obtaining Pvalues for all variables ?

Any advices for programming this in a proper way these will be very appreciated.

  • 4
    Do not remove the intercept (that's almost never a good idea). Use package multcomp to compare the groups. See 2nd chapter of the vignette: https://cran.r-project.org/web/packages/multcomp/vignettes/multcomp-examples.pdf – Roland Mar 13 '19 at 09:48
  • 1
    if you have instead of the variable `color` three separated variables (0, 1 values) then you'll have high correlations between them (note that in this example `yellow = +(green + red != 1)`), which will cause `glm` to produce `NA`s for at least one of them. Adding all of the groups is not the solution to what you are trying to do, instead find an alternative way to evaluate the groups like suggested in the previous comment by @Roland – DS_UNI Mar 13 '19 at 10:28
  • Thanks a lot for our advices ! It help me a lot to understand my problem – Pierre-louis Stenger Mar 26 '19 at 10:15

0 Answers0