3

Question moved to CrossValidated


I am trying to express the difference in the 'speed of increase' between two categories in gam modelling. My data represents the cumulative values over time [0-100%], but I wish (for comparability with other studies) to express them in yearly values. I wonder if it is even meaningfull, if my models are gam, and therefore non-linear?

Previously, to calculated the 'annual values' per two groups by the Compound annual growth rate (CARG) from the starting and ending values/number of years from the gam predicted values here. Now, I wonder if is it somehow possible get those value from the model itselft, and converted to annual increase? Something, like 'Betas' from 'glm' to show that one predictor is more important then the other?

My data looks like this: I have two groups, with the cumulative rate of forest loss over time (ranging between 0-100% from no forest loss to complete forest loss). Eg. group yellow has predicted lost ~ 40 % of forest from 2000 to 2017, group blue has predicted lost ~ 10 % over the same time. My dummy data are available here and in dput form below. My pseudo-code model: forest_loss ~ year + location, where different groups represent different location.

# Restructure the data first
  dd$location = as.factor(dd$location)  # claim grouping variable as factor
  dd$forest_loss <- dd$forest_loss/100  # the the values between 0-1

# Plot the gam model
   ggplot(dd, aes(x = year, 
                y = forest_loss, 
                group = location,
               color = location)) +
  geom_smooth(method = 'gam', 
              method.args = list(family = "betar"),
              formula = y~s(x, bs = 'cs'))  + 
  geom_point()

My dummy model is in a formula similar to this: m1<-bam(y ~ s(x, by = grp) + grp, dd, family = betar) (I am using the form as used for the real data, althought I have more predictors there), based on bam/gam. Family betar is because by predicted values ranges between 0-100 (0-1).

This is the dummy example so some part of teh code (gam) can produce errors. But my main question is how to express the annual increase per group from the cumulative values themselves? Is it already accessbible from summary(m1), and it is meaningfull?

Thank you for your thoughts.

enter image description here

Dummy data (shared as suggested here using the dput)

   dd<- structure(list(year = c(2000L, 2001L, 2002L, 2003L, 2004L, 2005L, 
2006L, 2007L, 2008L, 2009L, 2010L, 2011L, 2012L, 2013L, 2014L, 
2015L, 2016L, 2017L, 2000L, 2001L, 2002L, 2003L, 2004L, 2005L, 
2006L, 2007L, 2008L, 2009L, 2010L, 2011L, 2012L, 2013L, 2014L, 
2015L, 2016L, 2017L), forest_loss = c(2, 2.5, 2.7, 2.9, 2.9, 3, 3.1, 3.1, 
3.1, 4, 5, 5.6, 5.8, 5.9, 6.7, 7.2, 9, 10, 3, 3.2, 3.4, 3.5, 
3.5, 7, 7.2, 8, 15, 19, 22, 25, 27, 29, 32, 33, 35, 40), location = structure(c(1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
2L, 2L, 2L), .Label = c("blue", "yellow"), class = "factor")), row.names = c(NA, 
-36L), class = "data.frame")

Answer to comments:

somehow, the tidymv::plot_smooths(m1) provides an error:

Error in `parse_expr()`:
! `x` must contain exactly 1 expression, not 0.
Run `rlang::last_error()` to see where the error occurred.

Output from the summary(m1)

Family: Beta regression(75.174) 
Link function: logit 

Formula:
y ~ s(x, by = grp)

Parametric coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)   -2.671      0.083  -32.19   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Approximate significance of smooth terms:
                 edf Ref.df Chi.sq  p-value    
s(x):grpblue   1.000  1.000  11.64 0.000647 ***
s(x):grpyellow 3.684  4.547 283.12  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

R-sq.(adj) =  0.952   Deviance explained =   85%
-REML = -64.262  Scale est. = 1         n = 36

Partial effect by gratia: gratia::draw(m1, scales = 'free')

enter image description here

maycca
  • 3,848
  • 5
  • 36
  • 67
  • 1
    This question might be better suited in [CrossValidated](https://stats.stackexchange.com/) – Maël Sep 05 '22 at 14:54
  • 1
    Some observations: 1) The plot you show is a loess curve (what geom_smooth uses by default), not a generalized additive model. 2) You use family = betar in your bam model when your data is not bounded between 0 and 1. 3) Your model on the example data does not fit the data properly (try plotting m1 with tidymv::plot_smooths, trained on your dummy data). 4) I think what you are after is simply summary(m1). – Ventrilocus Sep 06 '22 at 06:02
  • @Ventrilocus re 2) I suspect they are plotting on a % scale (given how they describe the problem in terms of %) - I doubt the model would even fit if provided data like this as Simon's `betar` family would set all the response data to some value just a little less than 1 – Gavin Simpson Sep 06 '22 at 08:35
  • Thank you all ofr your thoughts. I have tried to update my question by answering to each comment. Still, somehow I even can properly fit the model itself: `m1~gam(y ~ s(x, by = grp), data = dd, method = 'REML', family = betar)` as I am geting error ` inner loop 3; can't correct step size`, which I do not understand? My preudocode should looks like this: `forest_loss~year + location`. Thank you again ofr your thoughts! – maycca Sep 06 '22 at 14:18

1 Answers1

2

If you want to use plot_smooth from tidymv you should specify the series and comparison argument like this:

library(mgcv)
library(tidymv)
dd$location = as.factor(dd$location)  # claim grouping variable as factor
dd$forest_loss <- dd$forest_loss/100  # the the values between 0-1

m1<-bam(forest_loss ~ s(year, by = location) + location, dd, family = betar)
tidymv::plot_smooths(m1, year, location)

Created on 2022-09-09 with reprex v2.0.2

Quinten
  • 35,235
  • 5
  • 20
  • 53