3

How do I create a loop and/or a function to divide 200 columns (and create 200 new columns/variables) by another column to get a percentage?

How do I do this in a loop so I can do 200 columns? and how do I name the name the columns so that it is the old column name with a "p_" in front of it? Is this possible?

For example I'm trying to do something like this but with 200 columns.

fans <- data.frame(
  population = c(1234, 5678, 2345, 6789, 3456, 7890,
                4567, 8901, 5678, 9012, 6789),
  bearsfans = c(123, 234, 345, 456, 567,678, 789, 890, 901, 135, 246),
  packersfans = c(11,22,33,44,55,66,77,88,99,100,122),
  vikingsfans = c(39, 49, 59, 61, 32, 22, 31, 92, 52, 10, 122))

print(fans)

attach(fans)

## create new columns which are the ratio of fans to population

fans$p_bearsfan = bearsfans/population
print(fans)


 Output:
 ##     population bearsfans packersfans vikingsfans p_bearsfan
 ##  1        1234       123          11          39 0.09967585
 ##  2        5678       234          22          49 0.04121169
shp5009
  • 47
  • 3

4 Answers4

1
temp <- sapply(fans[-1], function(x) x / fans$population)
colnames(temp) <- paste0("p_", colnames(temp))
cbind(fans, temp)



population bearsfans packersfans vikingsfans p_bearsfans p_packersfans p_vikingsfans
1        1234       123          11          39  0.09967585   0.008914100   0.031604538
2        5678       234          22          49  0.04121169   0.003874604   0.008629799
3        2345       345          33          59  0.14712154   0.014072495   0.025159915
4        6789       456          44          61  0.06716748   0.006481072   0.008985123
5        3456       567          55          32  0.16406250   0.015914352   0.009259259
6        7890       678          66          22  0.08593156   0.008365019   0.002788340
7        4567       789          77          31  0.17276111   0.016860083   0.006787826
8        8901       890          88          92  0.09998877   0.009886530   0.010335917
9        5678       901          99          52  0.15868263   0.017435717   0.009158154
10       9012       135         100          10  0.01498003   0.011096316   0.001109632
11       6789       246         122         122  0.03623509   0.017970246   0.017970246
Lennyy
  • 5,932
  • 2
  • 10
  • 23
1

If you're happy with a suffixed new column name (instead of prefixing), this is a one-liner using dplyr::mutate_at. I assume here that all relevant columns end with the word "fans".

With suffixes

fans %>% mutate_at(vars(ends_with("fans")), list(percent = ~.x / population))
#   population bearsfans packersfans vikingsfans bearsfans_percent
#1        1234       123          11          39        0.09967585
#2        5678       234          22          49        0.04121169
#3        2345       345          33          59        0.14712154
#4        6789       456          44          61        0.06716748
#5        3456       567          55          32        0.16406250
#6        7890       678          66          22        0.08593156
#7        4567       789          77          31        0.17276111
#8        8901       890          88          92        0.09998877
#9        5678       901          99          52        0.15868263
#10       9012       135         100          10        0.01498003
#11       6789       246         122         122        0.03623509
#   packersfans_percent vikingsfans_percent
#1          0.008914100         0.031604538
#2          0.003874604         0.008629799
#3          0.014072495         0.025159915
#4          0.006481072         0.008985123
#5          0.015914352         0.009259259
#6          0.008365019         0.002788340
#7          0.016860083         0.006787826
#8          0.009886530         0.010335917
#9          0.017435717         0.009158154
#10         0.011096316         0.001109632
#11         0.017970246         0.017970246

With prefixes

To turn the suffixes into prefixes requires on more step

fans %>%
    mutate_at(vars(ends_with("fans")), list(percent = ~.x / population)) %>%
    rename_at(vars(ends_with("percent")), ~sub("(.+)_percent", "p_\\1", .x))
#   population bearsfans packersfans vikingsfans p_bearsfans p_packersfans
#1        1234       123          11          39  0.09967585   0.008914100
#2        5678       234          22          49  0.04121169   0.003874604
#3        2345       345          33          59  0.14712154   0.014072495
#4        6789       456          44          61  0.06716748   0.006481072
#5        3456       567          55          32  0.16406250   0.015914352
#6        7890       678          66          22  0.08593156   0.008365019
#7        4567       789          77          31  0.17276111   0.016860083
#8        8901       890          88          92  0.09998877   0.009886530
#9        5678       901          99          52  0.15868263   0.017435717
#10       9012       135         100          10  0.01498003   0.011096316
#11       6789       246         122         122  0.03623509   0.017970246
#   p_vikingsfans
#1    0.031604538
#2    0.008629799
#3    0.025159915
#4    0.008985123
#5    0.009259259
#6    0.002788340
#7    0.006787826
#8    0.010335917
#9    0.009158154
#10   0.001109632
#11   0.017970246
Maurits Evers
  • 49,617
  • 4
  • 47
  • 68
  • Thanks. This is very helpful. I maybe shouldn't have added fans to the variable names and was just thinking columns. My variables do have groups of names with the same suffixes but not all of them and the suffixes are different by group. Some of the columns are named completely different. Are there ways to do it just by column number and still add the 'p_' to the beginning or end of the variable name? – shp5009 Sep 20 '19 at 14:07
  • @shp5009 Sure; you can specify the column numbers directly. For example, `fans %>% mutate_at(2:3, list(percent = ~.x / population))` creates new percentage columns only based on columns `2:3`. – Maurits Evers Sep 20 '19 at 14:18
1

We can directly divide multiple columns with one column. We use grep to select columns which end with "fans" and use those names to assign new columns.

cols <- grep("fans$", names(fans), value = TRUE)
fans[paste0("p_", cols)] <- fans[cols]/fans$population

fans
#   population bearsfans packersfans vikingsfans p_bearsfans p_packersfans p_vikingsfans
#1        1234       123          11          39     0.09968      0.008914      0.031605
#2        5678       234          22          49     0.04121      0.003875      0.008630
#3        2345       345          33          59     0.14712      0.014072      0.025160
#4        6789       456          44          61     0.06717      0.006481      0.008985
#5        3456       567          55          32     0.16406      0.015914      0.009259
#6        7890       678          66          22     0.08593      0.008365      0.002788
#7        4567       789          77          31     0.17276      0.016860      0.006788
#8        8901       890          88          92     0.09999      0.009887      0.010336
#9        5678       901          99          52     0.15868      0.017436      0.009158
#10       9012       135         100          10     0.01498      0.011096      0.001110
#11       6789       246         122         122     0.03624      0.017970      0.017970

Also as a side note : Why is it not advisable to use attach() in R, and what should I use instead?

Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
0

Data

  fans <- data.frame(
  population = c(1234, 5678, 2345, 6789, 3456, 7890,
                 4567, 8901, 5678, 9012, 6789),
  bearsfans = c(123, 234, 345, 456, 567,678, 789, 890, 901, 135, 246),
  packersfans = c(11,22,33,44,55,66,77,88,99,100,122),
  vikingsfans = c(39, 49, 59, 61, 32, 22, 31, 92, 52, 10, 122))

Naive Solution

for (c in names(fans)[-1]){
  fans[[paste0("p_",c)]] <- fans[[c]] / fans[["population"]]
}
Community
  • 1
  • 1
slava-kohut
  • 4,203
  • 1
  • 7
  • 24
  • 1
    Having two nested loops (`apply` within `for`) is not necessary and creates unnecessary overhead. In fact there shouldn't be a need for `apply` here as simple division in R `/` is already vectorised. – Maurits Evers Sep 19 '19 at 22:14
  • 1
    [continued] So the `for` loop would become `for (c in names(fans)[2:ncol(fans)]) fans[[paste0("p_",c)]] <- fans[[c]] / fans[["population"]]`. – Maurits Evers Sep 19 '19 at 23:10
  • @MauritsEvers yes I agree, that's very inefficient - I realize that. Edited my post. – slava-kohut Sep 19 '19 at 23:13
  • @MauritsEvers out of curiosity, I checked how slow is the solution with `apply`. It is still almost 10x faster than `mutate` (but 3x slower than naive loop over columns - the fastest of all posted in this thread). – slava-kohut Sep 19 '19 at 23:49
  • You should include the results from those benchmark tests in your answer. I'm sure this will be interesting to OP and future readers. I certainly would be interested. It doesn't surprise me that `mutate_at` is not the fastest method. `tidyverse` methods are usually more about code readability (and even that would be contested by some) than about speed. – Maurits Evers Sep 19 '19 at 23:55
  • @MauritsEvers Yes, will do when time permits. I also included one more solution using `Reduce` in benchmarks, but it is still slower than the naive loop. – slava-kohut Sep 20 '19 at 00:01
  • Great! You may already be aware of the R package `microbenchmark` but if not, it's a great library to quickly perform benchmark comparisons and show their results. Definitely worth an upvote from me if you were to include all this in your post:-) – Maurits Evers Sep 20 '19 at 00:16
  • Thanks, everybody. – shp5009 Sep 20 '19 at 14:37