I understand it's a well-known classic problem, but despite my research I failed to fixed this problem with my data.
I have this kind of data:
df
SNP Site Color Frequence
1 scaffold10000|size69197_10061 K Green 0.4404348
2 scaffold10000|size69197_10061 G Green 0.6700000
3 scaffold10000|size69197_10061 G Red 0.7171429
4 scaffold10000|size69197_10061 K Yellow 0.7937500
5 scaffold10000|size69197_10061 T Yellow 0.7202174
6 scaffold10000|size69197_10061 E Red 0.7373469
7 scaffold10000|size69197_10061 G Yellow 0.6150000
8 scaffold10000|size69197_10061 T Red 0.5668750
9 scaffold10000|size69197_10061 K Red 0.6190385
10 scaffold10000|size69197_10061 T Green 0.5629412
11 scaffold10000|size69197_10061 E Yellow 0.8312500
12 scaffold10000|size69197_10061 E Green 0.5474286
And I want to known if there are statistical differences between the three colours and the four sites for this SNP (named "scaffold10000|size69197_10061").
I want to ponderate these variables (the 3 colours and the 4 sites), this is why I choose glm()
function.
model <- glm(formula = Frequence ~ Color + Site, family=quasibinomial(), data=df)
And this give me these coefficients:
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.2905 0.3105 0.936 0.3856
ColorRed 0.4450 0.3084 1.443 0.1991
ColorYellow 0.8298 0.3215 2.581 0.0417 *
SiteT -0.2268 0.3644 -0.622 0.5566
SiteK -0.2221 0.3645 -0.609 0.5646
SiteE 0.1809 0.3760 0.481 0.6475
---
So the Green and the G site doesn't appears (because both are categorical if I understood correctly).
According to these issues in R blogger and in Stackoverflow I understand how to remove the intercept (to get a model easier to understand) in adding -1
or + 0
in the formula.
model <- glm(formula = Frequence ~ Color + Site - 1, family=quasibinomial(), data=df)
So I at least one categorical variable appears:
Coefficients:
Estimate Std. Error t value Pr(>|t|)
ColorGreen 0.2905 0.3105 0.936 0.3856
ColorRed 0.7355 0.3185 2.309 0.0603 .
ColorYellow 1.1202 0.3319 3.376 0.0149 *
SiteT -0.2268 0.3644 -0.622 0.5566
SiteK -0.2221 0.3645 -0.609 0.5646
SiteE 0.1809 0.3760 0.481 0.6475
I failed to encode something in order to the 4th site appears
Firstly I try to merge 2 different models:
model1 <- glm(formula = Frequence ~ Site - 1, family=quasibinomial(), data=df)
model2 <- glm(formula = Frequence ~ Color - 1, family=quasibinomial(), data=df)
with different ways but didn't work (and may didn't make sense..)
Put others -1
or + 0
didn't work neither:
model <- glm(formula = Frequence ~ 0 + Color + Site - 1, family=quasibinomial(), data=df)
According the answer to this similar issue (and also this about lm()
, simply add two sum-to-zero constrains on parameters:
contrasts(ok$Site) <- contr.sum(4, contrasts=F)
contrasts(ok$Color) <- contr.sum(3, contrasts=F)
or use this (I don't remember on each tread on Stackoverflow)
relevel(ok$Site, "E")
relevel(ok$Site, "T")
relevel(ok$Site, "K")
relevel(ok$Site, "G")
and re-run the model. But these two possibilities also failed.
So I try to split the data.frame
in order to had manually the variables into the model:
df2
SNP Site Color Frequence Green Yellow Red K G T E
1 scaffold10000|size69197_10061 K Green 0.4404348 TRUE FALSE FALSE TRUE FALSE FALSE FALSE
2 scaffold10000|size69197_10061 G Green 0.6700000 TRUE FALSE FALSE FALSE TRUE FALSE FALSE
3 scaffold10000|size69197_10061 G Red 0.7171429 FALSE FALSE TRUE FALSE TRUE FALSE FALSE
4 scaffold10000|size69197_10061 K Yellow 0.7937500 FALSE TRUE FALSE TRUE FALSE FALSE FALSE
5 scaffold10000|size69197_10061 T Yellow 0.7202174 FALSE TRUE FALSE FALSE FALSE TRUE FALSE
6 scaffold10000|size69197_10061 E Red 0.7373469 FALSE FALSE TRUE FALSE FALSE FALSE TRUE
7 scaffold10000|size69197_10061 G Yellow 0.6150000 FALSE TRUE FALSE FALSE TRUE FALSE FALSE
8 scaffold10000|size69197_10061 T Red 0.5668750 FALSE FALSE TRUE FALSE FALSE TRUE FALSE
9 scaffold10000|size69197_10061 K Red 0.6190385 FALSE FALSE TRUE TRUE FALSE FALSE FALSE
10 scaffold10000|size69197_10061 T Green 0.5629412 TRUE FALSE FALSE FALSE FALSE TRUE FALSE
11 scaffold10000|size69197_10061 E Yellow 0.8312500 FALSE TRUE FALSE FALSE FALSE FALSE TRUE
12 scaffold10000|size69197_10061 E Green 0.5474286 TRUE FALSE FALSE FALSE FALSE FALSE TRUE
(The TRUE and FALSE can be change into 0 and 1 with df2[df2=="FALSE"]<-0
.
model <- glm(formula=Frequence ~ Red + Green + Yellow + K + T + E + G -1, family=quasibinomial(), data=df2)
Now, all variables are in coefficient:
Coefficients: (2 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
RedFALSE 1.1202 0.3319 3.376 0.0149 *
RedTRUE 0.7355 0.3185 2.309 0.0603 .
GreenTRUE -0.8298 0.3215 -2.581 0.0417 *
YellowTRUE NA NA NA NA
KTRUE -0.2221 0.3645 -0.609 0.5646
TTRUE -0.2268 0.3644 -0.622 0.5566
ETRUE 0.1809 0.3760 0.481 0.6475
GTRUE NA NA NA NA
But NA
appears now.
According to this issue in Stackexchange, I checked if the model matrix has full rank, and the answer is no.
# Get model matrix ...
X <- model.matrix(~ Red + Green + Yellow + K + T + E + G - 1, family=quasibinomial(), data=as.data.frame(ok))
> X
RedFALSE RedTRUE GreenTRUE YellowTRUE KTRUE TTRUE ETRUE GTRUE
1 1 0 1 0 1 0 0 0
2 1 0 1 0 0 0 0 1
3 0 1 0 0 0 0 0 1
4 1 0 0 1 1 0 0 0
5 1 0 0 1 0 1 0 0
6 0 1 0 0 0 0 1 0
7 1 0 0 1 0 0 0 1
8 0 1 0 0 0 1 0 0
9 0 1 0 0 1 0 0 0
10 1 0 1 0 0 1 0 0
11 1 0 0 1 0 0 1 0
12 1 0 1 0 0 0 1 0
# Get rank of model matrix
qr(X)$rank
> 6
# Get number of parameters of the model = number of columns of model matrix
ncol(X)
> 8
So if there is no -1
, the first column of X
is the intercept and if there is the -1
the Red column is duplicate (one for TRUE and one for FALSE).
So, there is 8 column and 6 rank. Normally I should have had rather 14 columns and 14 rank no ? (7 variables (3 colours and 4 sites) * 2 (TRUE or FALSE))
So, how can I encode my model in order to force obtaining Pvalues for all variables ?
Any advices for programming this in a proper way these will be very appreciated.