0

I have a dataset with a population variable, as well as a few races ("white", "black", "hispanic"), and I want to be able to loop through the races so that for each race, a "percent_race" variable is created ("percent_white", etc.), and the race variable is then dropped.

I am most familiar with stata, where you can designate the string you are looping through within the loop using a `'. This allows me to name the new variables using a string from my loop that also serves to indicate what variables should be used in the formula for calculating those new variables. Here is what I mean:

loc races white black hispanic

foreach race in races {
   generate `race'_percentage = (population/`race')*100
   drop `race'
   }

In R, I want something to the same effect:

races <- list("white", "black", "hispanic")

df %>%
   for (race in races) {
      mutate(percent_"race" = (population/race)*100) %>%
      select(df, -c(race)) %>%
      }

I threw the quotes around race when naming the variable as a filler; I know that doesn't work, but you see how I want the variables to be named.

There might be other things wrong with how I am approaching this in R. I've always done data transformation and analysis in stata and moved to R for visualization, but I'm trying to learn to do it all in R. I'm not even sure if using a for loop within a pipe is proper here, but it makes sense to me within this little problem I have created for myself.

bricevk
  • 197
  • 8
  • Can you post the format of your data with `dput(head(df))`? I think what you are asking should be quite straightforward but it's not clear what your data looks like - i.e. what is being divided by what. See [here](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) for more. – SamR Jul 20 '22 at 12:09

2 Answers2

1

Your stata code implies a certain structure of df, namely, that there are separate columns for white, black, and hispanic. In that case, the structure should look something like the sample data I have constructed below, and suggests that you can use mutate(across()) to transform the three variables.

races <- c("white", "black", "hispanic")
df %>% 
  mutate(across(all_of(races), ~.x*100/population,.names = "percent_{.col}")) %>%
  select(-all_of(races))

Output:

   population percent_white percent_black percent_hispanic
1       71662     96.303480     0.5288716         3.167648
2       77869     90.231029     4.0503923         5.718579
3       22985     69.071133    12.7996519        18.129215
4       49924     79.546911     7.5454691        12.907620
5       88292      2.462284    14.8699769        82.667739
6       82554     47.779635     7.2485888        44.971776
7       65403     75.846674     5.6297112        18.523615
8       85160     21.641616    36.5124472        41.845937
9       66434     31.819550    18.1352922        50.045158
10      29641     23.163861    65.9154549        10.920684

Input:

set.seed(123)
df = data.frame(population=sample(20000:100000, size = 10)) %>% 
  mutate(
    white = ceiling(population*runif(10)),
    black = ceiling((population-white)*runif(10)),
    hispanic = population-white-black
)

   population white black hispanic
1       71662 69013   379     2270
2       77869 70262  3154     4453
3       22985 15876  2942     4167
4       49924 39713  3767     6444
5       88292  2174 13129    72989
6       82554 39444  5984    37126
7       65403 49606  3682    12115
8       85160 18430 31094    35636
9       66434 21139 12048    33247
10      29641  6866 19538     3237
langtang
  • 22,248
  • 1
  • 12
  • 27
  • That is exactly what I needed and makes a lot of sense for the most part. Where can I read up on the exact syntax you used there? I'm a little confused by the "~.x" and the "{.col}". I've practiced regexes with grepl and gsub a bit, but I don't exactly get how you knew what to put there. – bricevk Jul 20 '22 at 12:48
  • the second argument in `across()` is the function that you want to apply to each of the columns indicated in the first argument. Using `~` is the tidy (`purrr`) approach to lambda functions; the `.x` is a stand in for the column (similar to python's `lambda x: x.upper()`, for example). The `.names` is a way to apply a `glue` style approach to renaming the columns.. this is a string specification where I have used both a simple string `"percent_"` combined with the special `{.col}` which refers to the name of the current column. – langtang Jul 20 '22 at 12:55
  • That makes a lot of sense. Thank you so much! – bricevk Jul 20 '22 at 12:59
0

It's atypical if not explicitly unallowed to pipe a data frame into a for loop like that. A more typical and tidy way would be something like reshaping the data to summarize:

df <- data.frame(
  id = c('1', '2', '3'),
  population = c(100, 200, 300),
  white = c(50, 75, 100),
  black = c(25, 50, 150),
  hispanic = c(25, 75, 50)
)

df %>%
  tidyr::pivot_longer(!c(id, population)) %>%
  dplyr::mutate(percent = value/population) %>% 
  tidyr::pivot_wider(c(id, population), names_from = name, names_prefix = "percent_")

This code takes the wide data, reshapes it to long (so each 'id/race' combination is unique), calculates the percent, and then goes back to a wide format with the names percent_'race'.

dfletchy
  • 1
  • 2