Recoding several variables at once
I would like to recode all "999" values in the variables to missing. In Stata, I can do
forvalue i = 1(1)3{
replace var`i' ="NA" if var`i' =="999"
}
(For completeness) You could also do the recoding of several variables using lapply
.
The lapply()
function takes a set of variables and applies a function e.g. ifelse
to it. You need to tell it the dataset and the variables using the []
subsetting, e.g. data[,variables]
.
Then you define what you want to do, this could be anything recoding etc. you, that uses a variable.
The function starts by defining something similar to the "i
" local in the Stata loop: function(var)
, here var
would have a similar role to the i
.
Finally, you need to say where the result of lapply
goes, i.e. to new or recoded variables, again using data[,variables]
.
Here an example:
# Example data
data <- data.frame(
var1 = c( 1,2,999),
var2 = c(1,999,2),
var3 = c(1,3,999)
)
# Object with the names of the variables you like to recode.
vars_to_recode <- c("var1","var2","var3")
# Recoding
data[ ,vars_to_recode] <- lapply(data[ ,vars_to_recode],
function(var)
ifelse(var == 999, NA, var)
)
data
# var1 var2 var3
# 1 1 1 1
# 2 2 NA 3
# 3 NA 2 NA
What this does is actually closer to Stata's replace
, in that the original variables are replaced with a transformed variable.
An alternative to lapply
, is map()
from the purrr
-package, but particularly for programming I (currently) prefer the base R function.
New variables containing the mean of old variables
A second part of the question that can also be answered using lapply
is how to get variables with containing the means of others. From the original question:
Also, if I have column named ht, wgt, bmi, I would like to calculate the mean of the column and store the mean in new column with respective name.
In Stata, I can do
foreach i of varlist ht wgt bmi{
gen `i'mean = mean(`i')
}
The solution, using lapply
simple calculates the mean and puts it into a new variable/column. This works because R automatically fills up any column ("vector") to the length of the dataframe (called "recycling").
Example data
df <- data.frame(
id = 1:2,
ht = c(154.5,164.2),
wgt = c(43.1 ,63),
bmi = c(18.1 ,23.4))
Define variables you want to change and names for new variables.
vars <- names(df[,2:4])
# Names for new variables
newvars <- paste(names(df),"_mean")
newvars
# [1] "ht _mean" "wgt _mean" "bmi _mean"
Generate new variables containing the means of the variables of interest:
df[,newvars] <- lapply(df[,vars],
function(var)
mean(var)
)
Result:
df
# ht wgt bmi ht _mean wgt _mean bmi _mean
# 1 154.5 43.1 18.1 159.35 53.05 20.75
# 2 164.2 63.0 23.4 159.35 53.05 20.75