Regression equation produces model outside of all data

Question

I'm fairly confused to why I produce a regression equation that is so outside of the range of all data in dataset. I have a feeling the equation is very sensitive to data with a big spread but Im still confused. Any assistance would be greatly appreciated, stats certainly isn't my first language!

For reference this is a geochemical thermodynamics problem: Im trying to fit the Maier-Kelley equation to some experimental data. The Maier-Kelley equation describes how the equilibrium constant (K), in this case dolomite dissolving in water, changes with temperature (T in this case in Kelvin).

log K = A + B.T + C/T + D.logT + E/T^2

To cut a long story short (see Hyeong and Capuano., 2001 if interested) the equilibrium constant (K) is the same as Log_Ca_Mg (ratio of calcium to magnesium ion acitivities).

The experimental data uses groundwater data from different locations and different depths (so identified by FIELD and DepthID - which are my random variables).

I have included 3 datasets

(Problem)Dataset 1:https://pastebin.com/fe2r2ebA

(Working)Dataset 2:https://pastebin.com/gFgaJ2c8

(Working)Dataset 3:https://pastebin.com/X5USaaNA

Using the following code, for dataset 1

> dat1 <- read.csv("PATH_TO_DATASET_1.txt", header = TRUE,sep="\t")
> fm1 <- lmer(Log_Ca_Mg ~ 1 + kelvin + I(kelvin^-1) + I(log10(kelvin)) + I(kelvin^-2) + (1|FIELD) +(1|DepthID),data=dat1)

Warning messages:
1: Some predictor variables are on very different scales: consider rescaling 
2: In checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv,  :
  Model failed to converge with max|grad| = 0.0196619 (tol = 0.002, component 1)
3: Some predictor variables are on very different

> summary(fm1)
Linear mixed model fit by REML. t-tests use Satterthwaite's method ['lmerModLmerTest']
Formula: Log_Ca_Mg ~ 1 + kelvin + I(kelvin^-1) + I(log10(kelvin)) + I(kelvin^-2) +      (1 | FIELD) + (1 | DepthID)
   Data: dat1

REML criterion at convergence: -774.7

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-3.5464 -0.4538 -0.0671  0.3736  6.4217 

Random effects:
 Groups   Name        Variance Std.Dev.
 DepthID  (Intercept) 0.01035  0.1017  
 FIELD    (Intercept) 0.01081  0.1040  
 Residual             0.01905  0.1380  
Number of obs: 1175, groups:  DepthID, 675; FIELD, 410

Fixed effects:
                   Estimate Std. Error         df t value Pr(>|t|)
(Intercept)       3.368e+03  1.706e+03  4.582e-02   1.974    0.876
kelvin            4.615e-01  2.375e-01  4.600e-02   1.943    0.876
I(kelvin^-1)     -1.975e+05  9.788e+04  4.591e-02  -2.018    0.875
I(log10(kelvin)) -1.205e+03  6.122e+02  4.582e-02  -1.968    0.876
I(kelvin^-2)      1.230e+07  5.933e+06  4.624e-02   2.073    0.873

Correlation of Fixed Effects:
            (Intr) kelvin I(^-1) I(10()
kelvin       0.999                     
I(kelvn^-1) -1.000 -0.997              
I(lg10(kl)) -1.000 -0.999  0.999       
I(kelvn^-2)  0.998  0.994 -0.999 -0.997
fit warnings:
Some predictor variables are on very different scales: consider rescaling
convergence code: 0
Model failed to converge with max|grad| = 0.0196619 (tol = 0.002, component 1)

For Dataset 2

> summary(fm2)
Linear mixed model fit by REML. t-tests use Satterthwaite's method ['lmerModLmerTest']
Formula: Log_Ca_Mg ~ 1 + kelvin + I(kelvin^-1) + I(log10(kelvin)) + I(kelvin^-2) +      (1 | FIELD) + (1 | DepthID)
   Data: dat2

REML criterion at convergence: -1073.8

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-3.0816 -0.4772 -0.0581  0.3650  5.6209 

Random effects:
 Groups   Name        Variance Std.Dev.
 DepthID  (Intercept) 0.007368 0.08584 
 FIELD    (Intercept) 0.014266 0.11944 
 Residual             0.023048 0.15182 
Number of obs: 1906, groups:  DepthID, 966; FIELD, 537

Fixed effects:
                   Estimate Std. Error         df t value Pr(>|t|)
(Intercept)      -9.366e+01  2.948e+03  1.283e-03  -0.032    0.999
kelvin           -2.798e-02  4.371e-01  1.289e-03  -0.064    0.998
I(kelvin^-1)      2.623e+02  1.627e+05  1.285e-03   0.002    1.000
I(log10(kelvin))  3.965e+01  1.067e+03  1.283e-03   0.037    0.999
I(kelvin^-2)      2.917e+05  9.476e+06  1.294e-03   0.031    0.999

Correlation of Fixed Effects:
            (Intr) kelvin I(^-1) I(10()
kelvin       0.999                     
I(kelvn^-1) -0.999 -0.997              
I(lg10(kl)) -1.000 -0.999  0.999       
I(kelvn^-2)  0.998  0.994 -0.999 -0.997
fit warnings:
Some predictor variables are on very different scales: consider rescaling
convergence code: 0
Model failed to converge with max|grad| = 0.0196967 (tol = 0.002, component 1)

For Dataset 3

> summary(fm2)
Linear mixed model fit by REML. t-tests use Satterthwaite's method ['lmerModLmerTest']
Formula: Log_Ca_Mg ~ 1 + kelvin + I(kelvin^-1) + I(log10(kelvin)) + I(kelvin^-2) +      (1 | FIELD) + (1 | DepthID)
   Data: dat3

REML criterion at convergence: -1590.1

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-4.2546 -0.4987 -0.0379  0.4313  4.5490 

Random effects:
 Groups   Name        Variance Std.Dev.
 DepthID  (Intercept) 0.01311  0.1145  
 FIELD    (Intercept) 0.01424  0.1193  
 Residual             0.03138  0.1771  
Number of obs: 6674, groups:  DepthID, 3422; FIELD, 1622

Fixed effects:
                   Estimate Std. Error         df t value Pr(>|t|)
(Intercept)       1.260e+03  1.835e+03  9.027e-02   0.687    0.871
kelvin            1.824e-01  2.783e-01  9.059e-02   0.655    0.874
I(kelvin^-1)     -7.289e+04  9.961e+04  9.044e-02  -0.732    0.866
I(log10(kelvin)) -4.529e+02  6.658e+02  9.028e-02  -0.680    0.872
I(kelvin^-2)      4.499e+06  5.690e+06  9.104e-02   0.791    0.860

Correlation of Fixed Effects:
            (Intr) kelvin I(^-1) I(10()
kelvin       0.999                     
I(kelvn^-1) -1.000 -0.997              
I(lg10(kl)) -1.000 -0.999  0.999       
I(kelvn^-2)  0.998  0.994 -0.999 -0.998
fit warnings:
Some predictor variables are on very different scales: consider rescaling
convergence code: 0
unable to evaluate scaled gradient
Model failed to converge: degenerate  Hessian with 1 negative eigenvalues

I've plotted 'all the data' but for the regression analysis there is no data above the red line or bellow the green line. Only points with a log_ca_mg value between the red and green line at any temperature are included in the regression analysis.

So looking at the regressions on a plot dataset 1 is just way off but as there is no data above the red line this just confuses me no end. The regression is sitting in an area where there is no data. For the other two datasets this isn't a problem. Even for datasets with smaller sizes (n=200) its approximately in the same area. The three datasets look relatively similar when plotted individually.

Im kind of lost. Any help in understanding this would be appreciated.

Im not too sure I understand you exactly. The regression model spits out Log Ca_Mg as a function of temperature (in kelvin). I then plot this regression model which describes the relationship between Log Ca_Mg and temperature (kelvin). To me it seems like no matter how bad the fit is, it should still plot somewhere between the red and green lines where all of the data is contained. At the moment it plots outside the range of all of the data — Hamish Robertson, Feb 22 '20 at 17:09
Forgive my earlier comment. It was half-baked. But looking at your code, I think it is notable that you are seeing warnings that the "Model failed to converge". Maybe this https://stats.stackexchange.com/questions/242109/model-failed-to-converge-warning-in-lmer might help. Also, the warning about the predictors being on very different scales might also be something to consider. Have a look at gung's answer to this post https://stats.stackexchange.com/questions/29781/when-conducting-multiple-regression-when-should-you-center-your-predictor-varia — xilliam, Feb 22 '20 at 20:15
Yes I thought with the failure to converge the model would be 'wrong' or not great but not so wildly outside of the range. I've looked at both questions but am really not particularly great with this stuff. Do I need to center each term of the regression per gung? say you have a variable, , that ranges from 1 to 2, but you suspect a curvilinear relationship with the response variable, and so you want to create an 2 term. If you don't center first, your squared term will be highly correlated with , which could muddy the estimation of the beta. Centering first addresses this issue. — Hamish Robertson, Feb 24 '20 at 15:30
are you having a problem with coding or the actual statistics behind the code? If it is the model itself I would ask your question on a cross validated. but to your question are you at all worried about the extreme correlation between all of your predictors? The lack of convergence is an indication that your predicted values in your model will be way off from your actual values. — Mike, Feb 24 '20 at 16:14
Hi Mike, I guess it's a combo of not being too sure how to address the issues in R but more so not being so familiar and good at the statistics. Are you saying the high correlation between all the predicators is a very bad thing in the similar way to the questions above? If so do you have any suggestions on how to reduce this high correlation between terms in terms of coding the model? Yes I probably should have asked on cross-validated, was quite sure which one was more appropriate. — Hamish Robertson, Feb 24 '20 at 23:42
Hi Hamish, I think the correlation issue suggests there may be a problem with 'collinearity', which is a situation where one predictor itself predicts another. Collinearity tends to make models unstable, meaning that small changes in inputs leads to big changes in output of the model. Can you describe why the linear model contains all the transformations of the variable kelvin? By that I mean kelvin^-1, log10(Kelvin) and kelvin^-2 ? Do you have some rationale for including them? If there is not a sound case to include them, it might be feasible to drop some of them. — xilliam, Feb 25 '20 at 07:05
Have you tried rescaling as suggested by the warning? e.g., `Log_Ca_Mg ~ 1 + kelvin + I(100 * kelvin^-1) + I(log10(kelvin)) + I(1e4 * kelvin^-2) + (1 | FIELD) + (1 | DepthID)` — Roland, Feb 25 '20 at 10:26
Hi Roland. The five terms of the model are essentially immediately usable in other thermodynamic parameters (see aqion.de/site/121 if interested, its in the bottom section). I'm wondering whether rescaling effects the validity of the regression equation. I dont think so right? You are just effectively changing the 'emphasis' on different coefficients. — Hamish Robertson, Feb 25 '20 at 11:42
@HamishRobertson You'd just change the unit of the coefficients. — Roland, Mar 02 '20 at 13:09

xilliam · Answer 1 · 2020-02-25T10:32:57.137

What follows is an attempt to diagnose what may be going wrong with your model. It will use Dataset 1 for this discussion:

As described in your question, when one runs the original model with Dataset 1, they recieve warnings:

# original model
fm1 <- lme4::lmer(Log_Ca_Mg ~ 1 + kelvin + I(kelvin^-1) + I(log10(kelvin)) + I(kelvin^-2) + (1|FIELD) +(1|DepthID),data=dat1)

Some predictor variables are on very different scales: consider rescaling convergence code: 0 Model failed to converge with max|grad| = 0.0196619 (tol = 0.002, component 1)

These and other information suggest your model has problems, perhaps related to the predictors being on a different scale.

Since fm1 has several predictors that are transformations of the variable 'kelvin',we can also check the model for collinearity with the car package vif function:

# examine collinearity with the vif (variance inflation factors)
> car::vif(fm1)
kelvin     I(kelvin^-1) I(log10(kelvin))     I(kelvin^-2) 
716333          9200929          7688348          1224275

These vif values suggest the fm1 model suffers from high collinearity.

We can try to drop the some of those predictors, to examine a simpler model:

fm1_b <- lme4::lmer(Log_Ca_Mg ~ 1 + kelvin + I(kelvin^-1) + (1|FIELD) +(1|DepthID),data=dat1)

When we run the code, we still get a warning about the predictors being on different scales:

Warning message: Some predictor variables are on very different scales: consider rescaling

At the same time the vif values are much smaller:

# examine collinearity with the vif (variance inflation factors)
  > car::vif(fm1_b)
kelvin I(kelvin^-1) 
46.48406     46.48406

Following gung's suggestion that I mentioned in the comments, we can see what happens when we center our kelvin variables:

dat1$kelvin_centered <- as.vector(scale(dat1$kelvin, center= TRUE, scale = FALSE ))
# Make a power transformation on the kelvin_centered variable
dat1$kelvin_centered_pwr <- dat1$kelvin_centered^-1

And check to see if they are correlated

# check the correlation of the centered vars
cor(dat1$kelvin_centered, dat1$kelvin_centered_pwr)
> cor(dat1$kelvin_centered, dat1$kelvin_centered_pwr)
[1] 0.08056641

And construct a different model with the centered variables:

# construct a modifed model
fm1_c <- lme4::lmer(Log_Ca_Mg ~ 1 + kelvin_centered + kelvin_centered_pwr + (1|FIELD) +(1|DepthID),data=dat1)

Notably, we don' see any warnings when we run the code with this model. And the vif values are quite low:

car::vif(fm1_c)

> car::vif(fm1_c)
    kelvin_centered kelvin_centered_pwr 
           1.005899            1.005899

Conclusion

The original model has a high degree of collinearity. Collinearity can make models unstable, which could account for why fm1 failed to converge, and why you are seeing weird predictions in the plots. Model fm1_c may or may not be the correct model for your purpose. It at least provides a lens to understand the issue with your original model.

Hi Hamish, Yes. Dropping predictors will change things. I was mainly trying to help you see a way forward. You could perhaps add the ```kelvin^-2``` term, with something like ```dat1$kelvin_centered_pwr2 <- dat1$kelvin_centered^-2``` and then including ```kelvin_centered_pwr2``` in the model. As for the log10(kelvin) term, I am not confident what to suggest. When you apply ```log10(dat1$kelvin_centered)``` you get ```NaN```s in the output. Perhaps a more-knowledgeable soul will swoop in. You could also ask a new question on the stats stack exchange site. — xilliam, Feb 25 '20 at 11:15
Hi Xilliam, thanks for your effort on this front, its super appreciated. In response to your earlier question I can drop predicators and use simpler models however they are somewhat different to the 5 term model I am using (see https://www.aqion.de/site/121 if interested, its in the bottom section). The five terms of the model are essentially immediately usable in other thermodynamic parameters. — Hamish Robertson, Feb 25 '20 at 11:22
Last thought, where in your ```fm1``` equation are the parameters A-E? Right now, ```fm1``` is coded to include 'kelvin' which I believe corresponds to the "T" variables in your equations. — xilliam, Feb 25 '20 at 12:08

Allan Cameron · Answer 2 · 2020-03-01T23:06:16.980

1

I think you are going about this the wrong way. It sounds like you are trying to estimate the parameters A, B, C, D and E in the Maier-Kelley equation. You can do this by using non-linear least squares rather than a linear mixed effects model.

Start by defining a function that replicates the formula:

MK_eq <- function(A, B, C, D, E, Temp)
{
  A + B * Temp + C / Temp + D * log10(Temp) + E / (Temp^2)
}

Now we use the nls function to get an estimate for A to E:

mod1 <- nls(Log_Ca_Mg ~ MK_eq(A, B, C, D, E, kelvin), 
            start = list(A = 1, B = 1, C = 1, D = 1, E = 2), data = dat1)

coef(mod1)
#>             A             B             C             D             E 
#>  4.802008e+03  6.538166e-01 -2.818917e+05 -1.717040e+03  1.755566e+07

and we can create a "regression line" by getting a prediction for every value of Kelvin between, say, 275 and 400 in increments of 0.1:

new_data <- data.frame(kelvin = seq(275, 400, 0.1))
new_data$Log_Ca_Mg <- predict(mod1, newdata = new_data)

and we can demonstrate that this is a good approximation by plotting our prediction over the sample:

ggplot(dat1, aes(x = kelvin, y = Log_Ca_Mg)) + 
  geom_point() + 
  geom_line(data = new_data, linetype = 2, colour = "red", size = 2)

Note that for simplicity I have avoided discussion of the random effects - it is possible to do mixed effects non-linear least squares using the nlme package, but it is more involved and the discussion here describes how to do it in more detail than I can here.

edited Mar 01 '20 at 23:06

answered Mar 01 '20 at 13:18

Allan Cameron

147,086
7
49
87

1

I dont think that *need to do this* is accurate. The equation provided is linear in the parameters and so can be estimated using standard linear methods. – user2957945 Mar 01 '20 at 23:00
@user2957945 fair enough. It does give a "correct" result however. I'll soften it to "can" – Allan Cameron Mar 01 '20 at 23:05
OP appears to have repeated measures in their data. Why do you suggest just ignoring this? – Roland Mar 02 '20 at 11:49
@Roland I don't suggest they ignore it. I specifically say that I have left them out for illustration. If I can find the time I'll try to produce a fully worked example. – Allan Cameron Mar 02 '20 at 12:33
I'd be surprised if a non-linear mixed-effects model converges where a linear mixed-effects model doesn't. Looking forward to the fully worked example. – Roland Mar 02 '20 at 12:45
@Roland you're probably right; on exploring the data there are 676 unique factor levels for DepthID and 410 for FIELD, with only 1175 data points in total. – Allan Cameron Mar 02 '20 at 12:49
@Roland perhaps this is what is causing the failure of convergence? When you remove the random effects terms and run the model as an `lm` it gives the same result (more or less) as my `nls` version. – Allan Cameron Mar 02 '20 at 13:07
Hi @Roland and co. Thanks for your comments and post in particular Allan. I'm working through these suggestions and others and will report back soon. – Hamish Robertson Mar 02 '20 at 23:21

Regression equation produces model outside of all data

2 Answers2