5

I am trying for some time to understand tidyverse design and how to program with it. I was trying to write a function that uses tidyselect semantics, and I found that tidyselect::eval_select appends numbers to lhs expressions. This was not surprising seeing that this sematic is used for column renaming. Unfortunately, my function meant for building a data structure doesn't need this behavior, it needs the regular name provided in lhs of the expression (duplicated as many times as necessary). I haven't managed to find out where this behavior is even coming from; it seems to be a make.unique but I can't find where it is implemented. If you know, I am quite curious to learn, if not, solving my problem shouldn't depend on it. All I want is for the lhs names to not have appended numbers, as in the example:

library(tidyverse)

# Data
data <- mtcars[, 8:11]

# Example
data %>%
  tidyselect::eval_select(rlang::expr(c(foo = 1, bar = c(2:4), foobar = c(1, "am", "gear", "carb"))), .)
#>     foo    bar1    bar2    bar3 foobar1 foobar2 foobar3 foobar4 
#>       1       2       3       4       1       2       3       4

# Function
test <- function(.data, ...) {
  loc <- tidyselect::eval_select(rlang::expr(c(...)), .data)
  names <- names(.data)
  list(names(loc), names[loc])
}

data %>%
  test(foo = 1, bar = c(2:4), foobar = c(1, "am", "gear", "carb"))
#> [[1]]
#> [1] "foo"     "bar1"    "bar2"    "bar3"    "foobar1" "foobar2" "foobar3"
#> [8] "foobar4"
#> 
#> [[2]]
#> [1] "vs"   "am"   "gear" "carb" "vs"   "am"   "gear" "carb"

Created on 2021-05-22 by the reprex package (v2.0.0)

Desired output:

#> [[1]]
#> [1] "foo"     "bar"    "bar"    "bar"    "foobar" "foobar" "foobar"
#> [8] "foobar"
#> 
#> [[2]]
#> [1] "vs"   "am"   "gear" "carb" "vs"   "am"   "gear" "carb"

Any help is greatly appreciated.

Claudiu Papasteri
  • 2,469
  • 1
  • 17
  • 30

2 Answers2

2

The problem is caused by a function called ensure_named deeply nested inside eval_selects implementation. It is part pf the vars_select_eval function.

ensure_named(pos, vars, uniquely_named, allow_rename)

The good news is that we just need to overwrite the uniquely_named argument and this argument is carried on from the first implementation function called eval_select_impl which is called by eval_select itself. So all we need to do is to rewrite tidyselect::eval_select.

To get the wanted output we need to do two things:

  1. Add uniquely_named = NULL as argument and specify it with FALSE when calling the function
  2. Specify the existing argument name_spec = "{outer}". Doing only this step will not suffice unless uniquely_named is set to FALSE.

Before the actual code, a note of caution:

tidyselect::eval_select does on purpose not allow duplicate column names.

For starters, it is not possible to easily create a tibble with duplicate column names:

tibble(a = 1:3, b = 4:6, a = 7:9)
#> Error: Column name `a` must not be duplicated.
#> Use .name_repair to specify repair.

One workaround is to use a list with tibble::new_tibble:

tibble::new_tibble(list(a = 1:3, b = 4:6, a = 7:9), nrow = 3)
#> # A tibble: 3 x 3
#>       a     b     a
#>   <int> <int> <int>
#> 1     1     4     7
#> 2     2     5     8
#> 3     3     6     9

For a data.frame it is only possible to create non-unique names, when the check.names argument is set to FALSE:

data.frame(a = 1:3, b = 4:6, a = 7:9, check.names = FALSE)
#>   a b a
#> 1 1 4 7
#> 2 2 5 8
#> 3 3 6 9

But when we use this data.frame with regular {dplyr} verbs, an error will be thrown, telling us that we cannot transform data frames with duplicate names:

data.frame(a = 1:3, b = 4:6, a = 7:9, check.names = FALSE) %>% 
  mutate(c = 1:3)
#> Error: Can't transform a data frame with duplicate names.

So from this we can assume that it is not recommended to use data.frames with duplicate names in the {tidyverse}. It probably contradicts the notion of tidy data.

This being said, below is the above mentioned approach to solve this problem:

library(tidyverse)

# Data
data <- mtcars[, 8:11]

# custom eval_select function
my_eval_select <- function(expr, data,
                           env = rlang::caller_env(),
                           ..., include = NULL, 
                           exclude = NULL, strict = TRUE,
                           name_spec = NULL,
                           uniquely_named = NULL, # this is the new argument
                           allow_rename = TRUE) {
  ellipsis::check_dots_empty()
  tidyselect:::eval_select_impl(data, names(data), rlang::as_quosure(expr, env), 
                   include = include, exclude = exclude, strict = strict, 
                   name_spec = name_spec, allow_rename = allow_rename,
                   uniquely_named = uniquely_named) # which we also add here
}

# example 1
data %>%
  my_eval_select(rlang::expr(c(foo = 1, bar = c(2:4), foobar = c(1, "am", "gear", "carb"))),
                          data = .,
                          name_spec = "{outer}",  # we need to specify this
                          uniquely_named = FALSE) # and this
#>    foo    bar    bar    bar foobar foobar foobar foobar 
#>      1      2      3      4      1      2      3      4

# example: custom function
test <- function(.data, ...) {
  loc <- my_eval_select(rlang::expr(c(...)),
                        data = .data,
                        name_spec = "{outer}",
                        uniquely_named = FALSE)
  names <- names(.data)
  list(names(loc), names[loc])
}

# test
data %>%
  test(foo = 1, bar = c(2:4), foobar = c(1, "am", "gear", "carb"))
#> [[1]]
#> [1] "foo"    "bar"    "bar"    "bar"    "foobar" "foobar" "foobar" "foobar"
#> 
#> [[2]]
#> [1] "vs"   "am"   "gear" "carb" "vs"   "am"   "gear" "carb"

Created on 2021-05-22 by the reprex package (v0.3.0)

TimTeaFan
  • 17,549
  • 4
  • 18
  • 39
  • 1
    Thank you for the detailed answer, I learned a lot. I agree with the principle that having duplicates in data frames or tibbles is a very bad idea. Do you think that a function like the one you provided as an answer would be also inadvisable in the case of building data structures (e.g. list of vectors that each have unique values, but values are repeted between vectors - with the values being column indices)? – Claudiu Papasteri May 23 '21 at 11:20
  • 1
    It depends. If this function and the "untidy" data structures it produces are part of an internal function then that's no problem. If other users interact with the function and the data structure, then it might be confusing. Once you use {tidyeval} syntax people might expect tidy output. If the users are only you and your team / colleagues then that's easily communicated. – TimTeaFan May 23 '21 at 20:51
  • 2
    Yes, the "untidy" data structure would be part of internal functions but the main function would provide only tidy output. Thank you again for the useful replies and directions. – Claudiu Papasteri May 24 '21 at 08:19
0

Thank you again @TimTeaFan for the thorough answer. I will keep it as the "right" answer because I find it so useful. I was late to come across the variable renaming rules of the tidyverse. Outer names is propagated to the selected elements according to the following rules: (1) With data frames, a numeric suffix is appended because columns must be uniquely named. (2) With normal vectors, the name is simply assigned to all selected inputs.

So I am posting this as an answer to my own question because it is much easier and achieves the same result for the purpose of my function that creates a simple data structure. I am not sure if there any downsides to this, but I can't see any from testing.

library(tidyverse)

# Data
data <- mtcars[, 8:11]
  
# custom function
test <- function(.data, ...) {
  data <- as.list(.data)
  loc <- tidyselect::eval_rename(rlang::expr(c(...)), data)
  names <- names(.data)
  list(names(loc), names[loc])
}

# test
data %>%
  test(foo = 1, bar = c(2:4), foobar = c(1, "am", "gear", "carb"))
#> [[1]]
#> [1] "foo"    "bar"    "bar"    "bar"    "foobar" "foobar" "foobar" "foobar"
#> 
#> [[2]]
#> [1] "vs"   "am"   "gear" "carb" "vs"   "am"   "gear" "carb"

Created on 2021-06-03 by the reprex package (v2.0.0)

Claudiu Papasteri
  • 2,469
  • 1
  • 17
  • 30