1

the story:

I'm facing a problem in gam where only 2 input-variable should be considered:

x = relative price (%) the customer paid for the product given the entry price for the club

b = binary, if the customer has to pay the product (VIPs get it for free)

the output-variable is

y = if the customer took the product

and this sims the data:

require(mgcv)
require(data.table)
set.seed(2017)
y <- sample(c(0, 1), 100, replace=T)
x <- rgamma(100, 3, 3)
b <- as.factor(ifelse(x<.5, 0, 1))
dat <- as.data.table(list(y=y, x=x, b=b))
dat[b=="0",x:=0]
plot(dat$x, dat$y, col=dat$b)

relative price

as you can see in the plot, customers who hadn't pay for the product have a relative price for the product at 0%, others have the relative prices between .5% and 3.5%

here comes the problem:

I want to model one dummy effect for b and a smooth effect for x (certainly only for those who has to pay), so I use b also as a by-variable in x:

mod <- bam(y~b+s(x, by=b), data=dat, family=binomial(link="logit"))
summary(mod)
par(mfrow=c(1,2))
plot(mod)

smooth effects

my question is:

a. why can you still see rug by s(x, b=1) at 0%, wouldn't it makes more sense if mgcv only consider those who has to pay? does this problem has s.th to do with the knots?

b. as you can see in the summary, the dummy effect is estimated as NA, this might has to do with the fact that the information of b was totally used in as by-variable in s(x) so the dummy b itself has no more information to give? how can I overcome this problem, in other words: is there a option to model a smooth term only for a subset of the data and make mgcv actually only use this subset to fit?

Community
  • 1
  • 1
97m423
  • 11
  • 4
  • So run the smoother using the subset argument to gam, then store its predicted value (from `predict`) in the data; then use that as a predictor in your next call to `gam`, and include the dummy too. –  Mar 21 '17 at 06:21
  • @dash2 that's interesting idea, thanks. I understand the subset-fitting, but how do you legit this into the whole dataset again? Let's say there are also other covariates than x, do you take these other covariates also into subset-fitting, and what do you do if these other covariates are estimated very differently in the whole dataset? – 97m423 Mar 22 '17 at 13:54
  • @李哲源ZheyuanLi since I got less than 15 reputation-points I can not vote :* – 97m423 Mar 22 '17 at 13:56
  • @97m423 I think to use the prediction for the whole dataset you'd implicitly set the prediction value to 0 for values outside the subset. For example you could do something like `b %in% mysubset * s(x,by=b)`. –  Mar 22 '17 at 22:58

1 Answers1

3

Your question is conceptually as same as How can I force dropping intercept or equivalent in this linear model?. You want to contrast b, rather than using all its levels.

In GAM setting, you want:

dat$B <- as.numeric(dat$b) - 1
y ~ b + s(x, by = B)

For factor by smooth, mgcv does not apply contrast to by, if this factor is unordered. This is generally appealing as often we want a smooth for each factor level. It is thus your responsibility to use some trick to get what you want. What I did in above is to coerce this two-level factor b to a numeric B, with the level you want to omit being numerically 0. Then use numerical 'by' B. This idea can not be extended to factors of more levels.


If your factor by has more than 2 levels and you still want to enforce a contrast, you need to use an ordered factor. For example, you can do

dat$B <- ordered(dat$b)
y ~ b + s(x, by = B)

Read more on 'by' variables from ?gam.models.

Community
  • 1
  • 1
Zheyuan Li
  • 71,365
  • 17
  • 180
  • 248
  • Thanks for your answer Zheyuan, but with both your suggestion gam still smooth x in all ranges as you can see it in plot(mod), the very first rug appears at 0 which just makes no sense. – 97m423 Mar 22 '17 at 13:49
  • @97m423 But you don't need to worry. Note that at 0, the by value is also 0 so it has no prediction effect! Spline is just a polynomial. It is legitimate to even extend its range to -inf and inf. What mgcv's plot.gam does, is to plot on the range on your covariate values. You can always produce the plot on the desired range you want, i.e, 0.5 to 3.5 in your example. – Zheyuan Li Mar 22 '17 at 13:50