cut
is your friend here; its output is a factor
, which, when used in models, R will auto-expand into the 10 dummy variables.
set.seed(2932)
x = runif(1e4)
y = 3 + 4 * x + rnorm(1e4)
x_cut = cut(x, 0:10/10, include.lowest = TRUE)
summary(lm(y ~ x_cut))
# Call:
# lm(formula = y ~ x_cut)
#
# Residuals:
# Min 1Q Median 3Q Max
# -3.7394 -0.6888 0.0028 0.6864 3.6742
#
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 3.16385 0.03243 97.564 <2e-16 ***
# x_cut(0.1,0.2] 0.43932 0.04551 9.654 <2e-16 ***
# x_cut(0.2,0.3] 0.85555 0.04519 18.933 <2e-16 ***
# x_cut(0.3,0.4] 1.26441 0.04588 27.556 <2e-16 ***
# x_cut(0.4,0.5] 1.66181 0.04495 36.970 <2e-16 ***
# x_cut(0.5,0.6] 2.04538 0.04574 44.714 <2e-16 ***
# x_cut(0.6,0.7] 2.44771 0.04533 53.999 <2e-16 ***
# x_cut(0.7,0.8] 2.80875 0.04591 61.182 <2e-16 ***
# x_cut(0.8,0.9] 3.22323 0.04545 70.919 <2e-16 ***
# x_cut(0.9,1] 3.60092 0.04564 78.897 <2e-16 ***
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
# Residual standard error: 1.011 on 9990 degrees of freedom
# Multiple R-squared: 0.5589, Adjusted R-squared: 0.5585
# F-statistic: 1407 on 9 and 9990 DF, p-value: < 2.2e-16
See ?cut
for more customizations
You can also pass cut
directly in the RHS of the formula, which would make using predict
a bit easier:
reg = lm(y ~ cut(x, 0:10/10, include.lowest = TRUE))
idx = sample(length(x), 500)
plot(x[idx], y[idx])
x_grid = seq(0, 1, length.out = 500L)
lines(x_grid, predict(reg, data.frame(x = x_grid)),
col = 'red', lwd = 3L, type = 's')
