R dplyr how to select variables by column number rather than column name with summarise

Question

I'm using dplyr and Hmisc to prepare a table of weighted statistics by group as per the R code below.

require(Hmisc)  # weighted statistcs

StTbl <- iris %>%
  group_by(Species) %>%                                                    # Group species
  summarise(n = n(),                                                       # number of records                  
            WtMn = wtd.mean(Sepal.Length, Petal.Width),                    # weighted mean
            WtSd = sqrt(wtd.var(Sepal.Length, Petal.Width)),               # weighted SD
            WtCV = WtMn/WtSd,                                              # weighted CV
            Minm = min(Sepal.Length),                                      # minumum
            Wp05 = wtd.quantile(Sepal.Length, Petal.Width , 0.05),         # p05
            Wp50 = wtd.quantile(Sepal.Length, Petal.Width , 0.50),         # p50
            Wp95 = wtd.quantile(Sepal.Length, Petal.Width , 0.95),         # p95 
            Wp975 = wtd.quantile(Sepal.Length, Petal.Width , 0.975),       # p975
            Wp99 = wtd.quantile(Sepal.Length, Petal.Width , 0.99),         # p99
            Maxm = max(Sepal.Length)                                       # maximum
  )

StTbl

A tibble: 3 x 12
  Species        n  WtMn  WtSd  WtCV  Minm  Wp05  Wp50  Wp95 Wp975  Wp99  Maxm
  <fct>      <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 setosa        50  5.05 0.356  14.2   4.3  4.61  5.06  5.62  5.70  5.72   5.8
2 versicolor    50  5.98 0.508  11.8   4.9  5.13  6     6.80  6.97  7      7  
3 virginica     50  6.61 0.626  10.6   4.9  5.8   6.5   7.7   7.7   7.9    7.9

Now rather than use the column names of the table I wish to use a column index so I can loop through a number of columns preparing the statistics tables for or each column. I've found there are a number of suggestions on how to do this on StackOverflow including:

double square or single brackets using the table name and index number, for example substituting ".[1]" or "iris1" instead of "Sepal.Length" in the code above - these suggestions run without errors but return NA results
Use the get function such as "get(iris1)" - this suggestion returns a invalid first argument error
The suggestion that dplyr does not really support column index and that column index is a bad idea and I should tackle the problem another way - I'm not sure what another 'tidyverse' way would this be?
Write a custom function - here I'm not sure where to start with this for my example

score 2 · Accepted Answer · answered Jul 24 '21 at 11:19

Making use of the .data pronoun from rlang you could write a custom function which takes a dataframe, the names of two variables and some additional grouping variables and computes your desired summary table like so:

library(dplyr)
library(Hmisc)

summary_table <- function(.data, x, y, ...) {
  .data %>%
    group_by(...) %>%                                                    # Group species
    summarise(n = n(),                                                       # number of records                  
              WtMn = wtd.mean(.data[[x]], .data[[y]]),                    # weighted mean
              WtSd = sqrt(wtd.var(.data[[x]], .data[[y]])),               # weighted SD
              WtCV = WtMn/WtSd,                                              # weighted CV
              Minm = min(.data[[x]]),                                      # minumum
              Wp05 = wtd.quantile(.data[[x]], .data[[y]] , 0.05),         # p05
              Wp50 = wtd.quantile(.data[[x]], .data[[y]] , 0.50),         # p50
              Wp95 = wtd.quantile(.data[[x]], .data[[y]] , 0.95),         # p95 
              Wp975 = wtd.quantile(.data[[x]], .data[[y]] , 0.975),       # p975
              Wp99 = wtd.quantile(.data[[x]], .data[[y]] , 0.99),         # p99
              Maxm = max(.data[[x]])                                       # maximum
    )  
}

summary_table(iris, "Sepal.Length", "Petal.Width", Species)
#> # A tibble: 3 x 12
#>   Species        n  WtMn  WtSd  WtCV  Minm  Wp05  Wp50  Wp95 Wp975  Wp99  Maxm
#>   <fct>      <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 setosa        50  5.05 0.356  14.2   4.3  4.61  5.06  5.62  5.70  5.72   5.8
#> 2 versicolor    50  5.98 0.508  11.8   4.9  5.13  6     6.80  6.97  7      7  
#> 3 virginica     50  6.61 0.626  10.6   4.9  5.8   6.5   7.7   7.7   7.9    7.9

summary_table(iris, "Sepal.Width", "Petal.Width", Species)
#> # A tibble: 3 x 12
#>   Species        n  WtMn  WtSd  WtCV  Minm  Wp05  Wp50  Wp95 Wp975  Wp99  Maxm
#>   <fct>      <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 setosa        50  3.47 0.399  8.69   2.3  3.06  3.46  4.27  4.4    4.4   4.4
#> 2 versicolor    50  2.80 0.310  9.04   2    2.3   2.86  3.20  3.37   3.4   3.4
#> 3 virginica     50  3.00 0.320  9.38   2.2  2.5   3     3.6   3.8    3.8   3.8

score 1 · Answer 2 · answered Jul 24 '21 at 11:36

To use column numbers instead of column names in dplyr you can subset the data from cur_data().

library(dplyr)

iris %>%
  group_by(Species) %>%                                                    
  summarise(n = n(),                                                       
            WtMn = wtd.mean(cur_data()[[1]], cur_data()[[4]]),             
            WtSd = sqrt(wtd.var(cur_data()[[1]], cur_data()[[4]])),        
            WtCV = WtMn/WtSd,                                              
            Minm = min(cur_data()[[1]]),                                   
            Wp05 = wtd.quantile(cur_data()[[1]], cur_data()[[4]] , 0.05),  
            Wp50 = wtd.quantile(cur_data()[[1]], cur_data()[[4]] , 0.50),  
            Wp95 = wtd.quantile(cur_data()[[1]], cur_data()[[4]] , 0.95),  
            Wp975 = wtd.quantile(cur_data()[[1]], cur_data()[[4]] , 0.975),
            Wp99 = wtd.quantile(cur_data()[[1]], cur_data()[[4]] , 0.99),  
            Maxm = max(cur_data()[[1]])                                    
  )

#  Species        n  WtMn  WtSd  WtCV  Minm  Wp05  Wp50  Wp95 Wp975  Wp99  Maxm
#  <fct>      <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 setosa        50  5.05 0.356  14.2   4.3  4.61  5.06  5.62  5.70  5.72   5.8
#2 versicolor    50  5.98 0.508  11.8   4.9  5.13  6     6.80  6.97  7      7  
#3 virginica     50  6.61 0.626  10.6   4.9  5.8   6.5   7.7   7.7   7.9    7.9

As of dplyr 1.1.0, we can use `pick(1)` instead of `cur_data()[[1]]`. — Liang Zhang, May 23 '23 at 06:38
@LiangZhang When I do that I get an error ` In argument: `Wp05 = wtd.quantile(pick(1), pick(4), 0.05)`. ℹ In group 1: `Species = setosa`. Caused by error in `order()`: ! unimplemented type 'list' in 'orderVector1'` — Ronak Shah, May 23 '23 at 09:40
I found that `pick()` always return `tibble()`. So `wp05 = Hmisc::wtd.quantile(unlist(pick(1)), unlist(pick(4)), 0.05)` will work, but not so gracefully. Hopefully this could be taken care of in {dplyr}, at least optionally. — Liang Zhang, May 23 '23 at 11:58
I filed [an issue](https://github.com/tidyverse/dplyr/issues/6855) there. Maybe there could be better option. — Liang Zhang, May 23 '23 at 12:11

score 1 · Answer 3 · answered Aug 25 '21 at 23:23

Combining the great solutions from Ronak Shah and stefan, I thought a custom function can potentially spare one from the repetitive typing...

summaryfun <- function (x,y){
  c(
  length(x),
  wtd.mean(x,y),
  sqrt(wtd.var(x, y)),
  (wtd.mean(x,y)/  sqrt(wtd.var(x, y))),
  min(x),
  map_dbl(c(0.05, 0.50, 0.95, 0.975, 0.99), ~wtd.quantile(x,y,.x)),
  max(x))  %>%
    set_names(
      c('n', 'WtMn', 'WtSd', 'WtCV', 'Minm', 'Wp05', 'Wp50', 'Wp95', 'Wp975', 'Wp99', 'Maxm')) %>% 
  return
  }

iris %>% 
  group_by(Species) %>% 
  #summarise(index_by_name = list(summaryfun(x=Sepal.Length, y=Petal.Width))) %>% 
  summarise(index_by_position = list(summaryfun(x=cur_data()[[1]], y=cur_data()[[4]]))) %>%
  unnest_wider(index_by_position)

# A tibble: 3 x 12
  Species     n  WtMn  WtSd  WtCV  Minm  Wp05  Wp50  Wp95 Wp975  Wp99  Maxm
  <fct>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 setosa     50  5.05 0.356  14.2   4.3  4.61  5.06  5.62  5.70  5.72   5.8
2 versic~    50  5.98 0.508  11.8   4.9  5.13  6     6.80  6.97  7      7  
3 virgin~    50  6.61 0.626  10.6   4.9  5.8   6.5   7.7   7.7   7.9    7.9

Clever! but difficult to follow - prefer longer but clearer – Markm0705 Apr 15 '22 at 04:49 — Markm0705, Apr 15 '22 at 04:49

R dplyr how to select variables by column number rather than column name with summarise

3 Answers3