0

I have a study with several cases, all containing data from multiple ordinal factor variables (genotypes) and multiple numeric variables (various blood samples (concentrations)). I am trying to set up an explorative model to test linearity between any of the numeric variables (dependent in the model) and any of the ordinal factor variables (independent in the model).

Dataset structure example (independent variables): genotypes

case_id   genotype_1   genotype_2   ... genotype_n
1         0            0                1
2         1            0                2
...       ...          ...              ...
n         2            1                0

and dependent variables (with matching case id:s): samples

case_id   sample_1   sample_2   ... sample_n
1         0.3        0.12           6.12
2         0.25       0.15           5.66
...       ...        ...            ...
n         0.44       0.26           6.62

Found one similar example in the forum which doesn't solve the problem:

model <- apply(samples,2,function(xl)lm(xl ~.,data= genotypes))

I can't figure out how to make simple linear regressions that go through any combination of a given set of dependent and independent variables. If using apply family I guess the varying (x) term should be the dependent variable in the model since every dependent variable should test linearity for the same set of independent variables (individually).

Extract from true data:

> genotypes

      case_id genotype_1 genotype_2 genotype_3 genotype_4 genotype_5
 1       1          2          2          1          1          0
 2       2        NaN          1        NaN          0          0
 3       3          1          0          0          0        NaN
 4       4          2          2          1          1          0
 5       5          0          0          0          1        NaN
 6       6          2          2          1          0          0
 7       9          0          0          0          0          1
 8      10          0          0          0        NaN          0
 9      13          0          0          0        NaN          0
10      15        NaN          1        NaN          0          1

> samples

   case_id    sample_1    sample_2     sample_3   sample_4    sample_5
 1       1  0.16092019  0.08814160 -0.087733372  0.1966070  0.09085343
 2       2 -0.21089678 -0.13289427  0.056583528 -0.9077926 -0.27928376
 3       3  0.05102400  0.07724300 -0.212567535  0.2485348  0.52406368
 4       4  0.04823619  0.12697286  0.010063683  0.2265085 -0.20257192
 5       5 -0.04841221 -0.10780329  0.005759269 -0.4092782  0.06212171
 6       6 -0.08926734 -0.19925538  0.202887833 -0.1536070 -0.05889369
 7       9 -0.03652588 -0.18442457  0.204140717  0.1176950 -0.65290133
 8      10  0.07038933  0.05797007  0.082702589  0.2927817  0.01149564
 9      13 -0.14082554  0.26783539 -0.316528107 -0.7226103 -0.16165326
10      15 -0.16650266 -0.35291579  0.010063683  0.5210507  0.04404433

SUMMARY: Since I have a lot of data I want to create a simple model to help me select which possible correlations to look further into. Any ideas out there?

NOTE: I am not trying to fit a multiple linear regression model!

  • Look at my answer here. https://stackoverflow.com/a/43941096/6118417 – Daniel Winkler Jul 19 '17 at 12:02
  • Also I just notice you are saying you want to test linearities. Ols does not test but assume linearity. – Daniel Winkler Jul 19 '17 at 12:47
  • Thank you for your answer! Unfortunately this doesn't solve the problem with multiple independent AND dependent variables. I would still have to go through every dependent variable manually. And, of course you are right about the technical parts in testing/assuming linearities. However one could use the assumption and look at how it performed as some kind of a test right? – andreasgoteson Jul 25 '17 at 07:28

1 Answers1

0

I feel like there must be a statistical test for linearity, but I can't recall it. Visual inspection is typically how I do it. Quick and dirty way to test for linearity for a large number of variables would be to test the corr() of each pair of dependent/independent variables. Small multiples would be a handy way to do it.

Alternately, for each dependent ordinal variable, run a corrplot vs. each independent (numerical) variable, a logged version of the independent variable, and the exponentiated version of the independant variable. If the result of CORR for the logged or exponented version has a higher p-value than the regular version, it seems likely you have some linearity issues.

Mox
  • 511
  • 5
  • 15