Use summarize and a for loop taking column names from a character vector

Question

I have a dataset which I cannot share here, but I need to create columns using a for loop and the column names should come from a character vector. Below I try to replicate what I am trying to achieve using the flights dataset from the nycflights13 package.

install.packages("nycflights13")
library(nycflights13)

flights <- nycflights13::flights
flights <- flights[c(10, 16, 17)]

var_interest <- c("distance", "hour")

for (i in 1:length(var_interest)) {
  flights %>% group_by(carrier) %>%
    summarize(paste(var_interest[i], "n", sep = "_") = sum(paste(var_interest[i])))
}

This code generates the following error:

Error: unexpected '=' in:
"  flights %>% group_by(carrier) %>%
    summarize(paste(var_interest[i], "n", sep = "_") ="
> }
Error: unexpected '}' in "}"

My actual dataset is more complex than this example and therefore, I need to follow this approach. So if you could help me find what I am missing here, that would be highly appreciated!

akrun · Accepted Answer · 2021-02-17T23:39:05.377

2

The code can be modified to evaluate (!!) the column after converting the string to symbol, while on the lhs of assignment (:=) do the evaluation (!!) of string as well

out <- vector('list', length(var_interest))
for (i in seq_along(var_interest)) {
out[[i]] <- flights %>%
   group_by(carrier) %>%
   summarize(!! paste(var_interest[i], "n", sep = "_") := 
       sum(!! rlang::sym(var_interest[i])), .groups = 'drop')
 }


lapply(out, head, 3)
#[[1]]
# A tibble: 3 x 2
#  carrier distance_n
#  <chr>        <dbl>
#1 9E         9788152
#2 AA        43864584
#3 AS         1715028

#[[2]]
# A tibble: 3 x 2
#  carrier hour_n
#  <chr>    <dbl>
#1 9E      266419
#2 AA      413361
#3 AS        9013

There are multiple ways to pass a string column name and evaluate it.

As above stated, convert to a symbol and evaluate (!!).
Make use of across which can take either unquoted, or string or column index as integer i.e. In that case, we don't even need any loop

flights %>%
      group_by(carrier) %>%
      summarise(across(all_of(var_interest), ~ 
               sum(., na.rm = TRUE), .names = '{.col}_n'), 
            .groups = 'drop') 
# A tibble: 16 x 3
#   carrier distance_n hour_n
#   <chr>        <dbl>  <dbl>
# 1 9E         9788152 266419
# 2 AA        43864584 413361
# 3 AS         1715028   9013
# 4 B6        58384137 747278
# 5 DL        59507317 636932
# 6 EV        30498951 718187
# 7 F9         1109700   9441
# 8 FL         2167344  43960
# 9 HA         1704186   3324
#10 MQ        15033955 358779
#11 OO           16026    550
#12 UA        89705524 754410
#13 US        11365778 252595
#14 VX        12902327  63876
#15 WN        12229203 151366
#16 YV          225395   9300

edited Feb 17 '21 at 23:39

answered Feb 17 '21 at 22:14

akrun

874,273
37
540
662

1

Thank you! I need to go through your codes carefully to understand fully. I need to familiarize myself with !!, sym and :=. – Anup Feb 17 '21 at 22:57
please excuse my lack of R knowledge... Why did you use sym in the second line of summarize but not in the first line? And what is .groups = 'drop' doing? – Anup Feb 18 '21 at 06:31
1

@TRa In the first line, i.e. on the lhs of `:=`, it is for assigning column name and string is enough, while on the `rhs`, we need to get the value of that column i.e. the reason I used `sym` + `!!`. Regarding the `.groups`, I hope [this](https://stackoverflow.com/questions/62140483/how-to-interpret-dplyr-message-summarise-regrouping-output-by-x-override/62140681#62140681) may help you understanding more – akrun Feb 18 '21 at 16:55
thank you. I tried to use your code in my dataset but I got the following error: `Error: Only strings can be converted to symbols Run `rlang::last_error()` to see where the error occurred.` – Anup Feb 18 '21 at 22:36
@TRa could be an issue with `packageVersion('dplyr')` I used `1.0.2` – akrun Feb 18 '21 at 22:37
thanks for the prompt response. Mine is `1.0.4`. – Anup Feb 18 '21 at 22:39
1

@TRa I updated my package `packageVersion('dplyr')# [1] ‘1.0.4’`. It is running fine with the same code. Maybe it is the `rlang` version. I have `packageVersion('rlang')# [1] ‘0.4.10’` – akrun Feb 18 '21 at 22:46

score 1 · Answer 2 · answered Feb 17 '21 at 23:08

A tidy way to do this might be to stack it longer rather than wider:

install.packages("nycflights13")
library(nycflights13)

flights <- nycflights13::flights %>%
  select(carrier,distance,hour)

by_carrier <- purrr::map_dfr( c('distance','hour'), function(x) {
  flights %>% 
    dplyr::group_by(carrier) %>%
    dplyr::summarize(n = sum(!!as.name(x))) %>%
    dplyr::mutate(key = x)
})

If you still want the for loop to append columns you can use the !!as.name() feature twice with something like

by_carrier <- NULL
for ( i in c('distance','hour')) {   
  df <- 
    flights %>%
    dplyr::group_by(carrier) %>%
    dplyr::summarize(!!as.name(i) := sum(!!as.name(i) ))
  by_carrier <- bind_cols(by_carrier,df)
}

although you'd have to clean up the carrier columns after that one.

Use summarize and a for loop taking column names from a character vector

2 Answers2