1

I have a variable x that is between 0 and 1, or (0,1]. I want to generate 10 dummy variables for 10 deciles of variable x. For example x_0_10 takes value 1 if x is between 0 and 0.1, x_10_20 takes value 1 if x is between 0.1 and 0.2, ...

The Stata code to do above is something like this:

forval p=0(10)90 {
    local Next=`p'+10
    gen x_`p'_`Next'=0
    replace x_`p'_`Next'=1 if x<=`Next'/100 & x>`p'/100
}

Now, I am new at R and I wonder how I can do above in R?

D_B
  • 131
  • 5
  • "between" isn't precise enough. The notation (a, b) for numbers >a and = a and <= b, that of (a, b] for numbers >a and <= b, etc. would help you and others here. See e.g. https://en.wikipedia.org/wiki/Interval_(mathematics) (Unfortunately SO unlike CV doesn't support civilised mark-up here, or I so understand.) – Nick Cox Feb 23 '20 at 11:35

2 Answers2

1

cut is your friend here; its output is a factor, which, when used in models, R will auto-expand into the 10 dummy variables.

set.seed(2932)

x = runif(1e4)
y = 3 + 4 * x + rnorm(1e4)

x_cut = cut(x, 0:10/10, include.lowest = TRUE)

summary(lm(y ~ x_cut))
# Call:
# lm(formula = y ~ x_cut)
# 
# Residuals:
#     Min      1Q  Median      3Q     Max 
# -3.7394 -0.6888  0.0028  0.6864  3.6742 
# 
# Coefficients:
#                Estimate Std. Error t value Pr(>|t|)    
# (Intercept)     3.16385    0.03243  97.564   <2e-16 ***
# x_cut(0.1,0.2]  0.43932    0.04551   9.654   <2e-16 ***
# x_cut(0.2,0.3]  0.85555    0.04519  18.933   <2e-16 ***
# x_cut(0.3,0.4]  1.26441    0.04588  27.556   <2e-16 ***
# x_cut(0.4,0.5]  1.66181    0.04495  36.970   <2e-16 ***
# x_cut(0.5,0.6]  2.04538    0.04574  44.714   <2e-16 ***
# x_cut(0.6,0.7]  2.44771    0.04533  53.999   <2e-16 ***
# x_cut(0.7,0.8]  2.80875    0.04591  61.182   <2e-16 ***
# x_cut(0.8,0.9]  3.22323    0.04545  70.919   <2e-16 ***
# x_cut(0.9,1]    3.60092    0.04564  78.897   <2e-16 ***
# ---
# Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
# 
# Residual standard error: 1.011 on 9990 degrees of freedom
# Multiple R-squared:  0.5589,  Adjusted R-squared:  0.5585 
# F-statistic:  1407 on 9 and 9990 DF,  p-value: < 2.2e-16

See ?cut for more customizations

You can also pass cut directly in the RHS of the formula, which would make using predict a bit easier:

reg = lm(y ~ cut(x, 0:10/10, include.lowest = TRUE))
idx = sample(length(x), 500)
plot(x[idx], y[idx])

x_grid = seq(0, 1, length.out = 500L)
lines(x_grid, predict(reg, data.frame(x = x_grid)), 
      col = 'red', lwd = 3L, type = 's')

plot with fit

MichaelChirico
  • 33,841
  • 14
  • 113
  • 198
0

This won't fit well into a comment, but for the record, the Stata code can be simplified down to

forval p = 0/9 {
    gen x_`p' = x > `p'/10  & `x' <= (`p' + 1)/10 
}

Note that -- contrary to the OP's claim -- values of x exactly zero will be mapped to zero for all these variables, both on their code and on mine (which is intended to be a simplification of their code, not a correct way to do it, modulo a difference of taste on variable names). That follows from the fact that 0 is not greater than 0. Again, values that are exactly 0.1, 0.2, 0.3, will in principle go in the lower bin, not the higher bin, but that is complicated by the fact that most multiples of 0.1 don't have exact binary representations (0.5 is clearly an exception).

Indeed, depending on details about their set-up that the OP doesn't tell us, indicator variables (dummy variables, in their terminology) may well be available in Stata without a loop or made quite unnecessary by factor variable notation. In that respect Stata is closer to R than may at first appear.

While not answering the question directly, the signal here to Stata and R users alike is that Stata need not be so awkward as might be inferred from the code in the question.

Nick Cox
  • 35,529
  • 6
  • 31
  • 47