Manual interaction plot linear regression in R

Question

I am trying to predict the mean abundance of animals sighted during different moon phases (factor), using log-transformed abundance data (better fit) and some other variables. The best model (lowest AIC) turned out to include the interaction of phase and survey duration and the cloud cover (both continuous):

LMoon<-lm(ln~Phase*Duration+Clouds, data=abund)

summary(LMoon)

Call:
lm(formula = ln ~ Phase * Duration + Clouds, data = abund)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.75416 -0.46311  0.09522  0.46591  1.85978 

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)       0.382031   0.876865   0.436 0.664125    
Phase2            2.130065   1.226305   1.737 0.085851 .  
Phase3            1.971060   1.818542   1.084 0.281351    
Phase4            0.608043   1.140122   0.533 0.595146    
Phase5            4.786674   1.151850   4.156 7.44e-05 ***
Phase6            0.958706   1.046831   0.916 0.362238    
Phase7            0.254711   3.425214   0.074 0.940888    
Phase8            0.865995   1.043916   0.830 0.409005    
Duration          0.069153   0.035407   1.953 0.053952 .  
Clouds           -0.004259   0.002401  -1.774 0.079494 .  
Phase2:Duration  -0.087843   0.047818  -1.837 0.069545 .  
Phase3:Duration  -0.089908   0.069652  -1.291 0.200109    
Phase4:Duration  -0.005424   0.046675  -0.116 0.907749    
Phase5:Duration  -0.172016   0.049369  -3.484 0.000768 ***
Phase6:Duration  -0.035597   0.041435  -0.859 0.392583    
Phase7:Duration   0.024084   0.176773   0.136 0.891939    
Phase8:Duration  -0.033424   0.042064  -0.795 0.428963    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.7721 on 89 degrees of freedom
Multiple R-squared:  0.3368,    Adjusted R-squared:  0.2176 
F-statistic: 2.825 on 16 and 89 DF,  p-value: 0.0009894

Now, due to this interaction, I need to make an interaction plot (CI are too wide when plotting the lsmeans). I've tried to use different functions mentioned out there but none of them worked. Apparently I need to calculate and plot manually which I did like this:

intercepts <- c(coef(LMoon)["(Intercept)"],
                coef(LMoon)["(Intercept)"] + coef(LMoon)["Phase2"],
                coef(LMoon)["(Intercept)"] + coef(LMoon)["Phase3"],
                coef(LMoon)["(Intercept)"] + coef(LMoon)["Phase4"],
                coef(LMoon)["(Intercept)"] + coef(LMoon)["Phase5"],
                coef(LMoon)["(Intercept)"] + coef(LMoon)["Phase6"],
                coef(LMoon)["(Intercept)"] + coef(LMoon)["Phase7"],
                coef(LMoon)["(Intercept)"] + coef(LMoon)["Phase8"])

lines.df <- data.frame(intercepts = intercepts,
                       slopes = c(coef(LMoon)["Duration"],
                                  coef(LMoon)["Duration"]+coef(LMoon)["Phase2:Duration"],
                                  coef(LMoon)["Duration"]+coef(LMoon)["Phase3:Duration"],
                                  coef(LMoon)["Duration"]+coef(LMoon)["Phase4:Duration"],
                                  coef(LMoon)["Duration"]+coef(LMoon)["Phase5:Duration"],
                                  coef(LMoon)["Duration"]+coef(LMoon)["Phase6:Duration"],
                                  coef(LMoon)["Duration"]+coef(LMoon)["Phase7:Duration"],
                                  coef(LMoon)["Duration"]+coef(LMoon)["Phase8:Duration"]),
                       Phase2 = levels(abund$Phase))

qplot(x = Duration, y = Sp2, color = Phase, data = abund) + 
  geom_abline(aes(intercept = intercepts, 
                  slope = slopes, 
                  color = Phase), data = lines.df)

The plot that I get is wrong as the y-values are on the original, true scale, but the lines are based on the lm which uses the log-transformed data.

interaction plot abundance, duration, lunar phases

To back-transform this someone told me that I won't get straight lines in the end, actually. Rather than using abline(), I should create a set of e.g. 100 new x values that cover the range of the duration data and use the coefficients to calculate your predicted y values. Then plot these using lines() and it should look like a smooth curve.

And this is where I get lost.

So I created a the set of new x-values for the range of the survey duration (min 15 max 45 min): dur2 <- seq(from = 15, to = 45, length.out=100)

Then once I got those values I am supposed to get the predicted y value for each x-value using the coefficients of my LM. After that, back-transforming the y-values to the original scale. And then using the x-values and the back-transformed y-values to add the lines to the plot.

How do I get the predicted values now exactly? I cannot use any pred type/function, I've tried it all. It just doesn't work with my model, so manual is the only way, but I have no clue how...

Hope anyone can help me with this, I've been trying for weeks and in despair by now, close to giving up.

Cheers!

PS Here the data:

> dput(subset(abund, Phase %in% c("Phase1", "Phase2")))

structure(list(Year = integer(0), Date = structure(integer(0), .Label = c("01/08/2009", 
"01/08/2016", "02/07/2019", "02/08/2009", "02/08/2012", "02/08/2016", 
"02/09/2007", "03/08/2007", "03/08/2009", "03/08/2014", "03/08/2015", 
"04/07/2019", "04/08/2009", "04/08/2013", "05/08/2009", "05/08/2014", 
"05/08/2015", "06/07/2008", "06/07/2019", "07/08/2009", "08/07/2010", 
"09/07/2010", "09/08/2015", "10/08/2009", "11/08/2009", "12/08/2009", 
"13/08/2009", "13/08/2014", "14/08/2009", "14/08/2012", "16/07/2006", 
"18/07/2009", "18/08/2015", "19/07/2011", "20/08/2009", "21/07/2011", 
"21/09/2009", "22/07/2011", "22/07/2016", "22/07/2017", "23/07/2007", 
"23/07/2016", "23/07/2017", "24/07/2017", "25/07/2007", "25/07/2010", 
"25/07/2017", "25/08/2016", "26/07/2010", "26/07/2011", "27/07/2006", 
"27/07/2011", "27/07/2012", "28/07/2016", "29/06/2019", "29/07/2005", 
"29/07/2009", "29/07/2010", "29/07/2016", "29/07/2019", "30/07/2005", 
"30/07/2007", "30/07/2016", "30/08/2005", "31/07/2005", "31/07/2009", 
"31/07/2014", "31/07/2016"), class = "factor"), NrSurvey = integer(0), 
    Duration = integer(0), Sp2 = integer(0), Phase = structure(integer(0), .Label = c("1", 
    "2", "3", "4", "5", "6", "7", "8"), class = "factor"), Clouds = integer(0), 
    Visibility = integer(0), ln = numeric(0)), row.names = integer(0), class = "data.frame")

tl;dr but see: https://www.nature.com/articles/npre.2010.4136.1 for why you shouldn't be log-transforming count data — Dylan_Gomes, Oct 14 '20 at 19:07
@Dylan_Gomes that's true but I don't have zeros in my data set, as I only focus on positive surveys i.e. where animals have been sighted. I just want to see if the number of animals differs between moon phases. For presence absence I am using different models — Cathrin, Oct 16 '20 at 09:43
@RussLenth could you specify? I've tried the emmeans before, however CI are too wide then due to the interaction — Cathrin, Oct 16 '20 at 09:43
The CI widths are based on the standard errors and the adjustment method. If you are saying that you are creating a bunch of CIs that you don't need, thereby unnecessarily increasing the size of the family that you are adjusting for, then chances are you need to use a `by` variable. — Russ Lenth, Oct 16 '20 at 14:25
@Cathrin, getting rid of zeros is also losing information, since absence is information. It might be worth looking at a poisson or negative binomial model, instead of what you are doing. But that is just my opinion. — Dylan_Gomes, Oct 16 '20 at 15:59
I agree completely on the zeros. A famous and dramatic example is the Challenger disaster. They didn't look at the cases where no O-ring failures occurred. If they had, they would have scrubbed the launch. The poisson or NB models are good suggestions. Or even just use log(y + 1) -- very common. — Russ Lenth, Oct 18 '20 at 02:14

score 2 · Accepted Answer · answered Oct 14 '20 at 18:48

2

Use predict to get predicted values. Don't calculate manually.

Use expand.grid() to generate a data frame of all combinations of your dur2 sequence and the other predictors at the value(s) you want plot. Something like this:

prediction_data = expand.grid(
  Duration = dur2,
  Phase= unique(abund$Phase),
  Clouds = mean(abund$Clouds) # Hold Clouds constant at some value
)

# column names in prediction_data need to match those in the model formula
prediction_data$pred = predict(LMoon, newdata = prediction_data)
prediction_data$pred_orig = exp(prediction_data$pred)

# plot
ggplot(prediction_data, aes(x = Duration, y = pred_orig, color = Phase)) +
  geom_line() +
  geom_point(data = abund)

Something like that should work.

Another nice option is to use broom::augment to generate the predictions. This can easily give standard errors and residuals for each prediction point as well.

library(broom)
prediction_data = augment(LMoon, newdata = prediction_data)

answered Oct 14 '20 at 18:48

Gregor Thomas

136,190
20
167
294

1

Make sure the `prediction_data` has a column named `Phase` and we should be fine. If you post a reproducible sample of data (maybe `dput(subset(abund, Phase %in% c("Phase1", "Phase2")))`, or perhaps a smaller subset, then we can actually run the code on something that looks like your data which is very helpful for debugging. – Gregor Thomas Oct 14 '20 at 19:27
Thanks so much, it worked!! -- Except the plotting, it keeps giving me Error in ``FUN(X[[i]], ...) : object 'pred_orig' not found``. However, when running just the line ggplot(prediction_data, aes(x = Duration, y = pred_orig)) without geom_line/point() the error does not occur. Any ideas? – Cathrin Oct 14 '20 at 21:08
Also noticed that the plot is empty. It is not when using the base=10 in log() before running the model etc, then the plot gives me some dots and lines at least, but still not with the geom_line/point() – Cathrin Oct 14 '20 at 21:22
You'll need to create the `pred_orig` column as in my answer. No idea about the empty plot, but if you share some a little sample data I could try to take a look. See my previous comment for how to share sample data. – Gregor Thomas Oct 15 '20 at 01:14
Hi agian, I am now having other issues with an interaction plot -- with binary presence/absence data. The plot connects each point between upper and lower limit, any idea why that is? I used your suggestion from above but that was for count data. Working with glm() – Cathrin Mar 04 '21 at 08:22
With that little information? Not a clue. I'd suggest asking a new question and providing a minimal reproducible example. Half the time the process of creating a minimal reproducible example will lead to you solving the problem yourself, and the other half of the time it makes it quick for others to help you. – Gregor Thomas Mar 05 '21 at 16:06
Yeh I did that and figured it out today! – Cathrin Mar 06 '21 at 17:32

Manual interaction plot linear regression in R

1 Answers1

Linked