3

I'm new to R and am using Stata. I can loop through the variables using foreach and forvalue in Stata.

I would like to loop over column rather than loop over row. For example, I have column named var1, var2, var3. And the data as below:

var1 var2 var3
  1    1   1   
  2   999  3
 999   2  999

I would like to recode all "999" values in the variables to missing. In Stata, I can do

forvalue i = 1(1)3{
  replace var`i' ="NA" if var`i' =="999"
}

Therefore, I have the outcome like

var1 var2 var3
  1    1   1   
  2   NA   3
  NA   2  NA

Also, if I have column named ht, wgt, bmi, I would like to calculate the mean of the column and store the mean in new column with respective name. The dataset is as below:

 ht     wgt   bmi
154.5  43.1 18.1
164.2  63   23.4

In Stata, I can do

foreach i of varlist ht wgt bmi{
  gen `i'mean = mean(`i')
}

And the outcome will be

 ht    wgt   bmi  htmean wgtmean bmimean
154.5  43.1 18.1  159.35  53.05   20.75
164.2  63   23.4  159.35  53.05   20.75

I've no idea how to do using R.

Nick Cox
  • 35,529
  • 6
  • 31
  • 47
kmY
  • 89
  • 1
  • 10
  • You probably won't need for-loops to do either of those in R; the corresponding functions are most likely _vectorized_. – neilfws May 29 '19 at 01:42
  • @neilfws to me it looks like the Stata code is looping over columns, so the replacement might be something like `dplyr::mutate_at()`. – Marius May 29 '19 at 01:45
  • Thanks. Yes, @Marius, I would like to loop over column rather than row. I've tried to use `mutate_at() `. However, it give the comment `Evaluation error: replacement has 0 rows, data has 1492.` My code is this: `f1 <- function(x) dat$x[dat$x==999] <-NA dat <- mutate_at(dat,vars(var1:var21), f1)` – kmY May 29 '19 at 02:27
  • If you put in a small sample of your data set (see https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) we will be better able to help you. – Aaron left Stack Overflow May 29 '19 at 02:45
  • @Aaron Thanks. I've added simple samples in the question. – kmY May 29 '19 at 03:02
  • For the record, your last Stata example should use `egen` not `gen` (meaning `generate`). – Nick Cox May 29 '19 at 08:51

3 Answers3

4

There are lots of different ways to do these kinds of things. E.g. for the height, weight, BMI example, you can do this basically the same way you did in Stata, using a for loop:

# For-loop approach
for (col in c("ht", "wgt", "bmi")) {
    new_col = paste0(col, "_mean")
    df2[, new_col] = mean(df2[, col])
}

The difference is there's a stronger separation between symbols in code and strings in R, so you specify the column names as strings, use paste0 to create strings representing the new column names, and then add them to the dataframe.

Another way to do this is using the dplyr package and the mutate_at function, which will apply the same transformation to multiple columns:

library(dplyr)

df2 %>%
    mutate_at(c("ht", "wgt", "bmi"), 
              list(mean = ~ mean(.)))

The syntax is a little tricky: first we give the column names, then the next argument shows how we want to transform the columns. . is a placeholder for the current column, ~ means R won't immediately try to calculate mean(.) but will wait until we have actual values to substitute in. When we use a list and give a name to the transformation, like list(transform = ~ . + 2), dplyr automatically uses the name as a suffix, so you get column names like x_transform, y_transform etc.

Nick Cox
  • 35,529
  • 6
  • 31
  • 47
Marius
  • 58,213
  • 16
  • 107
  • 105
  • Thanks so much. Both methods work well. How about looping through the column named var1, var2, var3? As I have 27 columns named as this, I would like to know if there is a better way to write the code rather than write `c("var1", "var2", "var3", "var4",......"var27)`. – kmY May 29 '19 at 04:48
  • Yes: `paste0("var", 1:27)` would create the full vector of column names. – Marius May 29 '19 at 04:51
2

Recoding several variables at once

I would like to recode all "999" values in the variables to missing. In Stata, I can do

forvalue i = 1(1)3{
  replace var`i' ="NA" if var`i' =="999"
}

(For completeness) You could also do the recoding of several variables using lapply.

The lapply() function takes a set of variables and applies a function e.g. ifelse to it. You need to tell it the dataset and the variables using the [] subsetting, e.g. data[,variables].
Then you define what you want to do, this could be anything recoding etc. you, that uses a variable. The function starts by defining something similar to the "i" local in the Stata loop: function(var), here var would have a similar role to the i. Finally, you need to say where the result of lapply goes, i.e. to new or recoded variables, again using data[,variables].

Here an example:

# Example data
data <- data.frame(
  var1 = c( 1,2,999),
  var2 = c(1,999,2),
  var3 = c(1,3,999)
)

# Object with the names of the variables you like to recode.
vars_to_recode <- c("var1","var2","var3")

# Recoding
data[ ,vars_to_recode] <- lapply(data[ ,vars_to_recode],
                                      function(var)
                                        ifelse(var == 999, NA, var)
                                      )
data

#    var1 var2 var3
# 1    1    1    1
# 2    2   NA    3
# 3   NA    2   NA

What this does is actually closer to Stata's replace, in that the original variables are replaced with a transformed variable.

An alternative to lapply, is map() from the purrr-package, but particularly for programming I (currently) prefer the base R function.

New variables containing the mean of old variables

A second part of the question that can also be answered using lapply is how to get variables with containing the means of others. From the original question:

Also, if I have column named ht, wgt, bmi, I would like to calculate the mean of the column and store the mean in new column with respective name. In Stata, I can do

foreach i of varlist ht wgt bmi{
  gen `i'mean = mean(`i')
}

The solution, using lapply simple calculates the mean and puts it into a new variable/column. This works because R automatically fills up any column ("vector") to the length of the dataframe (called "recycling").

Example data

df <- data.frame(
      id  = 1:2,
      ht  = c(154.5,164.2),
      wgt = c(43.1 ,63),
      bmi = c(18.1 ,23.4))

Define variables you want to change and names for new variables.

vars <- names(df[,2:4])

# Names for new variables
newvars <- paste(names(df),"_mean")
newvars
# [1] "ht _mean"  "wgt _mean" "bmi _mean"

Generate new variables containing the means of the variables of interest:

df[,newvars] <- lapply(df[,vars], 
                       function(var)
                           mean(var)
                       )

Result:

df

#   ht  wgt  bmi ht _mean wgt _mean bmi _mean
# 1 154.5 43.1 18.1   159.35     53.05     20.75
# 2 164.2 63.0 23.4   159.35     53.05     20.75
Tfsnuff
  • 181
  • 6
1

So several options here. I strongly recommend adding some sample data so that we can better help you. Depending on what you are doing you could do several things:

library(dplyr)
mtcars %>% 
    mutate(my_hp = case_when(
        hp<50~"Small",
        hp < 100~"Medium",
        TRUE~"Large"
    ))

This applies to conditional logic use the mutate function to create a new variable(column) and the case_when function to use different cases dependent on the value of hp.

Additionally, you could go with a base R approach which could look like

mtcars$my_hp <- ifelse(mtcars$hp < 50, "Small",
                                             ifelse(mtcars$hp< 100, "Medium",
                                                         "Large"))

So in this case you are creating a new value called my_hp using chained ifelse statements which check for a condition.

And if you absolutely wanted to do a loop, which you would not need to do in this case, you could do something like:

for( i in 1:nrow(mtcars)){
    mtcars$my_hp[i] <- ifelse(mtcars$hp[i] < 50, "Small",
                                                            ifelse(mtcars$hp[i]< 100, "Medium",
                                                                         "Large"))

}
MDEWITT
  • 2,338
  • 2
  • 12
  • 23
  • Thanks for reply. But as Marius said I would like to loop over column rather than loop over row. I've edited my question and given some examples. Hope this makes my question more clearer. – kmY May 29 '19 at 02:24