After all this commenting, guess I'll write an answer.
You should use a for
loop for this: it's quick to code, quick to execute, readable and straightforward:
for (i in seq_along(a)) a[[i]]$name = names(a)[i]
You could use map
or mapply
or lapply
instead of the for loop. In this case, I would think it will be less readable.
You could also use mutate
instead of [
for adding the column. This will be slower:
library(microbenchmark)
library(dplyr)
cars_tbl = tbl_df(cars)
mbm = microbenchmark
mbm(
mutate = {cars_tbl = mutate(cars_tbl, name = 'a')},
base = {cars_tbl['name'] = 'a'}
)
# Unit: microseconds
# expr min lq mean median uq max neval cld
# mutate 240.617 262.4730 293.29001 276.158 299.7255 813.078 100 b
# base 34.971 42.1935 55.46356 53.407 57.3980 226.932 100 a
For such a simple operation, [<-
is going to be hard to beat. data.table
will probably be faster, but only if the object is already a data.table
. If the object is a data.frame
rather than a tbl_df
, then the mutate
is about twice as slow. But these differences are in microseconds. Unless you are repeatedly doing this operation to lists of at least hundreds of thousands of data frames it won't matter.
This is not to say dplyr
has poor performance - when you are using the grouping operations, relying on the NSE built in to dplyr
, it's excellent. This is just a simple case where the simple base solution is easiest and also quickest.
If we increase the size of the data enough so that it takes a noticeable amount of time to do these operations (10 million rows, here), the differences essentially go away:
df = tbl_df(data.frame(x = rep(1, 1e7)))
mbm(
mutate = {df = mutate(df, name = 'a')},
base = {df['name'] = 'a'}
)
# Unit: milliseconds
# expr min lq mean median uq max neval cld
# mutate 58.08095 59.87531 132.3180 105.22507 207.6439 261.8121 100 a
# base 52.09899 53.96386 129.9304 99.96153 203.8581 237.0084 100 a
Implementing with for
loops and with map
, comparing [<-
and mutate
# base for loop
for (i in seq_along(a)) {
a[[i]]$name = names(a)[i]
}
# dplyr in for loop
for (i in seq_along(a)) {
a[[i]] = mutate(a[[i]], name = names(a)[i])
}
# dplyr hiding the loop in Map()
a = Map(function(x, y) mutate(x, name = y), x = a, y = names(a))
We could benchmark these (I did -- see the edit history if you want the results), but the differences are less than 1 millisecond so it shouldn't matter. Go with whatever is easiest for you to read, write, and understand.
All this comes with the caveat that if your eventual goal is to bind these data frames together and you want the name
column to see what list element the data came from, then that is implemented directly in dplyr::bind_rows
.