0

Say I have

library(dplyr)
a <- list(a=tbl_df(cars), b=tbl_df(iris))

How can I add to each element of this list a column name whose values are the name of the named element of the list? For instance, this how the output should look like for the first element

Source: local data frame [50 x 3]

   speed  dist  name
   (dbl) (dbl) (chr)
1      4     2     a
2      4    10     a
3      7     4     a
4      7    22     a
5      8    16     a
6      9    10     a
7     10    18     a
8     10    26     a
9     10    34     a
10    11    17     a
Dambo
  • 3,318
  • 5
  • 30
  • 79
  • 2
    Please provide a reproducible dataset. – lmo May 18 '16 at 17:20
  • 1
    In the first code part of you question, add the line `data(cars)`. – lmo May 18 '16 at 17:25
  • 1
    Super quick in a `for` loop: `for (i in seq_along(a)) a[[i]]$name = names(a)[i]` – Gregor Thomas May 18 '16 at 17:32
  • @imo `data(cars)` is unnecessary. The `datasets` package has had Lazy Loading of data for many years (most other R packages as well). – Gregor Thomas May 18 '16 at 17:36
  • @Gregor thanks, your solution is absolutely working, but would you have any approach more consistent with the `dplyr` environment? I was thinking about calling a `mutate` on each tbl_df, preferably using a function rather then a loop (isn't that gonna be faster? maybe that's something you might wanna add to your response). – Dambo May 18 '16 at 17:40
  • 1
    `dplyr` performance shines when you're doing things by a large number of groups within a single data frame. You don't have a data frame, you have a list of data frames. `dplyr` doesn't work on lists, so you'll need to use `map` or `lapply` or something to operate on each data frame, and right a custom anonymous function to do so. And it probably won't be any faster because what you're doing is so simple. – Gregor Thomas May 18 '16 at 17:47
  • See also [Is R's apply family more than syntactical sugar?](http://stackoverflow.com/q/2275896/903061) - the main reason to use an `apply` function rather than a loop should be readability. In this case it will actually be less readable so you should just use the `loop`. – Gregor Thomas May 18 '16 at 17:48
  • @Gregor Ah, I got an error and thought it was the data, but it was actually the `tbl_df` function. – lmo May 18 '16 at 17:59

1 Answers1

2

After all this commenting, guess I'll write an answer.

You should use a for loop for this: it's quick to code, quick to execute, readable and straightforward:

for (i in seq_along(a)) a[[i]]$name = names(a)[i]

You could use map or mapply or lapply instead of the for loop. In this case, I would think it will be less readable.

You could also use mutate instead of [ for adding the column. This will be slower:

library(microbenchmark)
library(dplyr)
cars_tbl = tbl_df(cars)
mbm = microbenchmark
mbm(
    mutate = {cars_tbl = mutate(cars_tbl, name = 'a')},
    base = {cars_tbl['name'] = 'a'}
)
# Unit: microseconds
#    expr     min       lq      mean  median       uq     max neval cld
#  mutate 240.617 262.4730 293.29001 276.158 299.7255 813.078   100   b
#    base  34.971  42.1935  55.46356  53.407  57.3980 226.932   100  a 

For such a simple operation, [<- is going to be hard to beat. data.table will probably be faster, but only if the object is already a data.table. If the object is a data.frame rather than a tbl_df, then the mutate is about twice as slow. But these differences are in microseconds. Unless you are repeatedly doing this operation to lists of at least hundreds of thousands of data frames it won't matter.

This is not to say dplyr has poor performance - when you are using the grouping operations, relying on the NSE built in to dplyr, it's excellent. This is just a simple case where the simple base solution is easiest and also quickest.

If we increase the size of the data enough so that it takes a noticeable amount of time to do these operations (10 million rows, here), the differences essentially go away:

df = tbl_df(data.frame(x = rep(1, 1e7)))
mbm(
    mutate = {df = mutate(df, name = 'a')},
    base = {df['name'] = 'a'}
)
# Unit: milliseconds
#    expr      min       lq     mean    median       uq      max neval cld
#  mutate 58.08095 59.87531 132.3180 105.22507 207.6439 261.8121   100   a
#    base 52.09899 53.96386 129.9304  99.96153 203.8581 237.0084   100   a

Implementing with for loops and with map, comparing [<- and mutate

# base for loop
for (i in seq_along(a)) {
    a[[i]]$name = names(a)[i]
}

# dplyr in for loop
for (i in seq_along(a)) {
    a[[i]] = mutate(a[[i]], name = names(a)[i])
}

# dplyr hiding the loop in Map()
a = Map(function(x, y) mutate(x, name = y), x = a, y = names(a)) 

We could benchmark these (I did -- see the edit history if you want the results), but the differences are less than 1 millisecond so it shouldn't matter. Go with whatever is easiest for you to read, write, and understand.

All this comes with the caveat that if your eventual goal is to bind these data frames together and you want the name column to see what list element the data came from, then that is implemented directly in dplyr::bind_rows.

Gregor Thomas
  • 136,190
  • 20
  • 167
  • 294
  • Thanks. Your answer still holds true if you start from two tbl_df() objects, but dplyr is twice as faster as in your example. I am just pointing that out since you mentioned the use of `data.table` conditional on prior transformation. – Dambo May 18 '16 at 18:16
  • Thanks, that's an excellent point. I re-ran the benchmark with `tbl_df` objects. – Gregor Thomas May 18 '16 at 18:19
  • 2
    I can't imagine anyone doing this operation more than once, and so the microsecond benchmark strikes me as largely meaningless. It's far more interesting to increase input data size such that a single run takes human-measureable time. – eddi May 18 '16 at 19:09
  • Strongly agreed. I wouldn't have benchmarked at all except for OP's comment after I first suggested a simple loop: *"I was thinking about calling a mutate on each tbl_df, preferably using a function rather then a loop (isn't that gonna be faster?..."* I should probably emphasize that the difference are trivial. – Gregor Thomas May 18 '16 at 19:15
  • @eddi Thanks for pointing that out, I was actually gonna have a about 20 tbl_df, that's why I was interested in understanding how the script performed. I was simply curious to see the difference because I struggle to understand when loops are preferable to vectorized alternatives, and it was somehow counterintuitive to me that a loop was the best solution in this case (even though we are speaking about trivial differences). – Dambo May 18 '16 at 20:32
  • Vectorized solutions are always fastest in R - and usually easiest to code as well. `apply` functions and the like aren't vectorized, per se. They're just nice wrappers so that you don't have to explicitly code the loop... most of the time. – Gregor Thomas May 18 '16 at 20:58
  • 1
    @Dambo Any vectorized solution, at the heart of it, is still just a simple loop. The difference is that normally for vectorized solutions the loop is happening in compiled internal code, vs doing an explicit loop in the interpreter (which involves going through multiple layers of grammar, interpreting, etc for each step of the loop). Obviously this difference will mainly matter for large loops, e.g. looping over rows of a large `data.frame`. In this case - (a) there isn't an internal loop that would do the job, (b) the loop is likely tiny (= number of tables in your list). – eddi May 18 '16 at 21:38
  • And just to clarify when I say vectorized solution I mean smth like `df$col = df$col + 5` vs `for (i in nrow(df)) df$col[i] = df$col[i] + 5`. As far as `apply` vs `for` loops go, they're largely similar in performance, and there is a good Q&A about it here on SO if you search for it. – eddi May 18 '16 at 21:43