0

I have a linear regression like this :

lmGeneexp = lm(gene_expression ~ (pos1 + pos2 +  pos3), data = donor_snp_sample) 

summary(lmGeneexp)

when I run this code, this is the result:

Coefficients: (2 not defined because of singularities)
            Estimate Std. Error t value Pr(>|t|)
(Intercept)    5.708     64.905   0.088    0.930
pos11        -25.853    436.678  -0.059    0.953
pos12        -48.653    443.310  -0.110    0.913
pos21         25.960    416.159   0.062    0.950
pos22             NA         NA      NA       NA
pos31         24.269    117.284   0.207    0.836
pos32             NA         NA      NA       NA

I can't understand why for each "pos#", there are 2 coefficients, for example for variable "pos1", there are "pos11" and "pos12" in the result. what is the problem with the code or my data?

thanks a lot

+++ This is an example of my data:

       pos1 pos2  pos3  gene_expression
row1    0    0     1          7.4
row2    0    0     2          8.5
row3    0    0     1          6.3
row4    1    0     2          3.5
row5    2    0     0          2.1
row6    1    0     0          7.4
...           
MMRA
  • 337
  • 1
  • 3
  • 11
  • 3
    Can you give an example of your data? – lil_barnacle Mar 07 '21 at 05:50
  • 3
    In particular we need to know the levels of the `pos1`, `pos2`, `pos3` variables (which based on your output are factors). `levels()` or `table()` for each of those variables would be helpful. There is a strong possibility that you have blanks or something similar in your data that lead to (1) data being imported as factors rather than numeric, (2) an extra level in your factors that throws off your regression. – Ben Bolker Mar 07 '21 at 05:54
  • @lil_barnacle I added an example of my data to question – MMRA Mar 07 '21 at 05:55
  • 1
    As @BenBolker mentioned, `post1`, `post2` and `post3` have 3 levels (0, 1 and 2). If you want only 1 coefficient for each predictor in the regression, you can convert them to numeric using `as.numeric()` or remove 1 level (e.g., assign 0 to be NA). – lil_barnacle Mar 07 '21 at 06:04

1 Answers1

4

It looks like your pos1, pos2, etc variables are coded as factors therefore they are treated as categorical variables in your regression. In this case the "0" values is treated as the reference level for each of these variables. A different coefficient is estimated for each other level compared to the reference level for each categorical variable. This is pretty standard reference level encoding for categorical variables. They aren't "unnecessary" because they specify the effect for each level.

It looks a bit odd because your variables end in numbers and your factor levels are also numbers. So the value pos11 = -25.853 actually means that the estimated mean for observations with value of of "1" for "pos1" (variable "pos1" + value "1" = "pos11") is ~25 units less than those with a value "0" for "pos1". You can things of the names as

pos11 => pos1_1_vs_0
pos12 => pos1_2_vs_0
pos21 => pos2_1_vs_0
pos22 => pos2_2_vs_0
pos31 => pos3_1_vs_0
pos32 => pos3_2_vs_0

If you did not intend to treat those values a categorical variables, be sure to investigate how the conversion to factor happened. Normally R will read in numeric values as numbers. The lm function will automatically convert characters to factors so if you want the values to be numeric, make sure they aren't read in as characters. If you do need to convert the values to numeric values before regression, you need to be careful. Here's a helper function that will do the conversion properly.

factor_to_numeric <- function(x) {
  stopifnot(is.factor(x))
  as.numeric(levels(x))[x]
}
MrFlick
  • 195,160
  • 17
  • 277
  • 295
  • 1
    ideally OP should go back and figure out *why* their data got read as factors (and stop it from happening/get them read in correctly), rather than converting the columns back to numeric afterward – Ben Bolker Mar 07 '21 at 06:12
  • 1
    I think it makes sense to use factors or ordered factors depending on the context +1 – Vons Mar 07 '21 at 06:14
  • 2
    @BenBolker True, if that was the intention. I would guess that they were correctly coded as factors to be treated as categorical given that they do take only three distinct values. I've added some clarification. I think the real problem was just the potentially confusing names. – MrFlick Mar 07 '21 at 06:15