12

I'm trying to develop a deeper understanding of using the dot (".") with dplyr and using the .data pronoun with dplyr. The code I was writing that motivated this post, looked something like this:

cat_table <- tibble(
  variable = vector("character"), 
  category = vector("numeric"), 
  n        = vector("numeric")
) 

for(i in c("cyl", "vs", "am")) {
  cat_stats <- mtcars %>% 
    count(.data[[i]]) %>% 
    mutate(variable = names(.)[1]) %>%
    rename(category = 1)
  
  cat_table <- bind_rows(cat_table, cat_stats)
}
# A tibble: 7 x 3
  variable category     n
  <chr>       <dbl> <dbl>
1 cyl             4    11
2 cyl             6     7
3 cyl             8    14
4 vs              0    18
5 vs              1    14
6 am              0    19
7 am              1    13

The code does what I wanted it to do and isn’t really the focus of this question. I was just providing it for context.

I'm trying to develop a deeper understanding of why it does what I want it to do. And more specifically, why I can't use . and .data interchangeably. I've read the Programming with dplyr article, but I guess in my mind, both . and .data just mean "our result up to this point in the pipeline." But, it appears as though I'm oversimplifying my mental model of how they work because I get an error when I use .data inside of names() below:

mtcars %>% 
  count(.data[["cyl"]]) %>% 
  mutate(variable = names(.data)[1])
Error: Problem with `mutate()` input `variable`.
x Can't take the `names()` of the `.data` pronoun
ℹ Input `variable` is `names(.data)[1]`.
Run `rlang::last_error()` to see where the error occurred.

And I get an unexpected (to me) result when I use . inside of count():

mtcars %>% 
  count(.[["cyl"]]) %>% 
  mutate(variable = names(.)[1])
  .[["cyl"]]  n   variable
1          4 11 .[["cyl"]]
2          6  7 .[["cyl"]]
3          8 14 .[["cyl"]]

I suspect it has something to do with, "Note that .data is not a data frame; it’s a special construct, a pronoun, that allows you to access the current variables either directly, with .data$x or indirectly with .data[[var]]. Don’t expect other functions to work with it," from the Programming with dplyr article. This tells me what .data isn't -- a data frame -- but, I'm still not sure what .data is and how it differs from ..

I tried figuring it out like this:

mtcars %>% 
  count(.data[["cyl"]]) %>% 
  mutate(variable = list(.data))

But, the result <S3: rlang_data_pronoun> doesn't mean anything to me that helps me understand. If anybody out there has a better grasp on this, I would appreciate a brief lesson. Thanks!

Brad Cannell
  • 3,020
  • 2
  • 23
  • 39

4 Answers4

10

Up front, I think .data's intent is a little confusing until one also considers its sibling pronoun, .env.

The dot . is something that magrittr::%>% sets up and uses; since dplyr re-exports it, it's there. And whenever you reference it, it is a real object, so names(.), nrow(.), etc all work as expected. It does reflect data up to this point in the pipeline.

.data, on the other hand, is defined within rlang for the purpose of disambiguating symbol resolution. Along with .env, it allows you to be perfectly clear on where you want a particular symbol resolved (when ambiguity is expected). From ?.data, I think this is a clarifying contrast:

disp <- 10
mtcars %>% mutate(disp = .data$disp * .env$disp)
mtcars %>% mutate(disp = disp * disp)

However, as stated in the help pages, .data (and .env) is just a "pronoun" (we have verbs, so now we have pronouns too), so it is just a pointer to explain to the tidy internals where the symbol should be resolved. It's just a hint of sorts.

So your statement

both . and .data just mean "our result up to this point in the pipeline."

is not correct: . represents the data up to this point, .data is just a declarative hint to the internals.


Consider another way of thinking about .data: let's say we have two functions that completely disambiguate the environment a symbol is referenced against:

  • get_internally, this symbol must always reference a column name, it will not reach out to the enclosing environment if the column does not exist; and
  • get_externally, this symbol must always reference a variable/object in the enclosing environment, it will never match a column.

In that case, translating the above examples, one might use

disp <- 10
mtcars %>%
  mutate(disp = get_internally(disp) * get_externally(disp))

In that case, it seems more obvious that get_internally is not a frame, so you can't call names(get_internally) and expect it to do something meaningful (other than NULL). It'd be like names(mutate).

So don't think of .data as an object, think of it as a mechanism to disambiguate the environment of the symbol. I think the $ it uses is both terse/easy-to-use and absolutely-misleading: it is not a list-like or environment-like object, even if it is being treated as such.

BTW: one can write any S3 method for $ that makes any classed-object look like a frame/environment:

`$.quux` <- function(x, nm) paste0("hello, ", nm, "!")
obj <- structure(0, class = "quux")
obj$r2evans
# [1] "hello, r2evans!"
names(obj)
# NULL

(The presence of a $ accessor does not always mean the object is a frame/env.)

r2evans
  • 141,215
  • 6
  • 77
  • 149
  • Thank you, @r2evans. But, I'm still not 100% sure why the mutate(variable = names(.data)[1]) throws an error. Any insight on that? – Brad Cannell Aug 13 '20 at 16:45
  • 1
    Thank you again. I'm still not sure that I've completely absorbed this. But, I'm much closer than I was. – Brad Cannell Aug 13 '20 at 16:59
  • Don't get me wrong, I *want* `.data` to reference the current-data, similar to `data.table`'s `.SD` reference. But alas, it is not. – r2evans Aug 13 '20 at 16:59
  • 1
    "why names(.data) throws an error." That is because `.data` is not a data frame but an environment (see my answer). @r2evans Nice answer but I think it is missing the crucial point that `.data` always represents the _current_ data. In particular, it represents data for the current group, not the entire data frame, and it will also contain new columns created in previous expressions. – Lionel Henry Aug 14 '20 at 06:04
  • @LionelHenry, thank you for piping in! I recognize that you are `c("aut", "cre")` for `rlang`, there is no refuting your knowledge on the subject. However, `names(env)` and `length(env)` both work on environments but don't work on .data, so perhaps calling it an environment can be misinterpreted? – r2evans Aug 14 '20 at 06:29
  • 1
    More precisely, it has the data structure of an environment (a chain of environments actually) rather than a data frame. This is a pronoun with its own accessor methods which _represents_ an environment. – Lionel Henry Aug 17 '20 at 08:04
  • Very interesting. Next time I have free time (ugh), I'll look into the implementation. I understand what you're saying but not sure why (1) a chain of environments is necessary to similar the namespace of a single frame/environment, and (2) why it would not support `environment`-friendly functions. Not meaning to distract from this topic, I'll look and pipe up elsewhere if I really have questions about it. Thanks, @LionelHenry! – r2evans Aug 17 '20 at 16:34
  • 1
    Regarding (1), it's because the lower level contains special bindings for tidy eval. And in some more complicated cases (tidyselect) we store functions in an upper level and data in a lower level. Regarding (2) we did it on purpose because environments are by nature unordered, unlike data frames. Allowing `names()` would be confusing and cause programming errors. – Lionel Henry Aug 18 '20 at 07:46
3

The . variable comes from magrittr, and is related to pipes. It means "the value being piped into this expression". Normally with pipes, the value from a previous expression becomes argument 1 in the next expression, but this gives you a way to use it in some other argument.

The .data object is special to dplyr (though it is implemented in the rlang package). It does not have any useful value itself, but when evaluated in the dplyr "tidy eval" framework, it acts in many ways as though it is the value of the dataframe/tibble. You use it when there's ambiguity: if you have a variable with the same name foo as a dataframe column, then .data$foo says it is the column you want (and will give an error if it's not found, unlike data$foo which will give NULL). You could alternatively use .env$foo, to say to ignore the column and take the variable from the calling environment.

Both .data and .env are specific to dplyr functions and others using the same special evaluation scheme, whereas . is a regular variable and can be used in any function.

Edited to add: You asked why names(.data) didn't work. If @r2evans excellent answer isn't enough, here's a different take on it: I suspect the issue is that names() isn't a dplyr function, even though names.rlang_fake_data_pronoun is a method in rlang. So the expression names(.data) is evaluated using regular evaluation instead of tidy evaluation. The method has no idea what dataframe to look in, because in that context there isn't one.

user2554330
  • 37,248
  • 4
  • 43
  • 90
  • Thank you, @user2554330. But, I'm still not 100% sure why the mutate(variable = names(.data)[1]) throws an error. Any insight on that? – Brad Cannell Aug 13 '20 at 16:52
  • 1
    See the addition. – user2554330 Aug 13 '20 at 19:08
  • 1
    Nice context. Your last paragraph suggests that if they implemented (say) `names.rlang_fake_data_pronoun`, they might be able to get *that* function to work. (It still doesn't address my *itch* to use `.data` like I do `.SD` ... :-) – r2evans Aug 13 '20 at 20:09
  • Actually there is a `names.rlang_fake_data_pronoun` method: it's what prints the error message. It can't do anything else, because it's being evaluated in the regular R evaluation scheme, not tidy eval. Tidy eval has access to lots of `dplyr`-specific information. – user2554330 Aug 13 '20 at 20:27
3

On a theoretical level:

. is the magrittr pronoun. It represents the entire input (often a data frame when used with dplyr) that is piped in with %>%.

.data is the tidy eval pronoun. Technically it is not a data frame at all, it is an evaluation environment.

On a practical level:

. will never be modified by dplyr. It remains constant until the next piped expression is reached. On the other hand, .data is always up to date. That means you can refer to previously created variables:

mtcars %>%
  mutate(
    cyl2 = cyl + 1,
    am3 = .data[["cyl2"]] + 10
  )

And you can also refer to column slices in the case of a grouped data frame:

mtcars %>%
  group_by(cyl) %>%
  mutate(cyl2 = .data[["cyl"]] + 1)

If you use .[["cyl"]] instead, the entire data frame will be subsetted and you will get an error because the input size is not the same as the group slice size. Tricky!

Lionel Henry
  • 6,652
  • 27
  • 33
2

Compare mtcars %>% count(.data[["cyl"]]) vs. mtcars %>% count(.[["cyl"]]).

mtcars %>% count(.[["cyl"]])
  .[["cyl"]]  n
1          4 11
2          6  7
3          8 14


mtcars %>% count(.data[["cyl"]])
  cyl  n
1   4 11
2   6  7
3   8 14

. is literally just the previous result. So the first is similar to:

. <- mtcars
count(., .[["cyl"]])

The second is a shorthand for looking up the variable by the string "cyl" and treating the previous result as the search path for the variable. For example, suppose you mispelled your variable name:

mtcars %>% count(.[["cyll"]])
   n
1 32

mtcars %>% count(.data[["cyll"]])
Error: Must group by variables found in `.data`.
* Column `cyll` is not found.

Using . will not throw an error because indexing to a non-existing column is a valid base-R operation that returns NULL.

Using .data will throw because using a non-existent variable:

mtcars %>% count(cyll)

Also throws.

thc
  • 9,527
  • 1
  • 24
  • 39
  • Thank you, @thc. But, I'm still not 100% sure why the mutate(variable = names(.data)[1]) throws an error. Any insight on that? – Brad Cannell Aug 13 '20 at 16:50
  • 1
    @BradCannell Because `.data` is not an object and is never evaluated internally. It's simply a keyword (they call it pronoun) that provides a way of accessing columns using bracket `[[` or `$` notation. The key concept is "lazy evaluation". The arguments in a function in R are not evaluated until you use them, and there are many ways to see the arguments without evaluation. Internally, `.data` is never used as an object. It's simply seen as an argument that directs the function to do certain things. – thc Aug 13 '20 at 19:04