Row names are stored lazily in a data frame but when copied they are fully evaluated
I have extensively updated this answer as I realised that I had reached an incorrect conclusion using tracemem()
to consider the memory location of objects, but not considering the object size. Instead, I use a helper function to create a simplified, tree representation of the output of lobstr::sxp(dat)
to show how objects are represented in memory. The conclusion is that it is unnecessary to pre-allocate row.names
, and that simply avoiding copying them as described in the excellent answer by Joris C. is sufficient.
tl;dr: never copy the row names attribute
When you create a data frame of n
rows without explicitly declaring the row names, the row names are stored as an integer vector of length 2 of the form c(NA, -n)
.
If you copy the row names attribute from one data frame to another, R evaluates this vector in order to copy it. This should never be done.
Alternatively you could use data.table
or tidyverse
, both of which keep attributes when a copy is made, avoiding the need to copy anything.
A closer look at what happens in memory
Let's create a data frame with 10 rows.
num_rows <- 10
set.seed(0)
dat <- data.frame(
x_char = sample(letters, num_rows),
x_int = sample(1:10, num_rows)
)
Let's look at how it appears in memory:
library(lobstr)
dat_sxp <- sxp(dat)
get_dat_obj_tree(dat_sxp)
1 dat VECSXP length: 2 mem_addr:0x7
2 ¦--x_char STRSXP length: 10 mem_addr:0x1
3 ¦--x_int INTSXP length: 10 mem_addr:0x2
4 °--_attrib LISTSXP length: 3 mem_addr:0x3
5 ¦--names STRSXP length: 2 mem_addr:0x4
6 ¦--class STRSXP length: 1 mem_addr:0x5
7 °--row.names INTSXP length: 2 mem_addr:0x6
The function replaces the memory addresses with unique integers (i.e. mem_addr:0x1
will remain the address of x_char
every time the real address is looked up, unless the memory location of x_char
actually changes).
We would expect the data to have length 10. But why are the row.names
only length 2? Let's print them:
rownames(dat) # "1" "2" "3" "4" "5" "6" "7" "8" "9" "10"
attr(dat, "row.names") # 1 2 3 4 5 6 7 8 9 10
Clearly these are vectors with length 10. You might notice that one is a character vector and one is an integer vector. This led me down a lot of dead-ends, until I found this comment in the R source code:
## As from R 2.4.0, row.names can be either character or integer.
## row.names() will always return character.
## attr(, "row.names") will return either character or integer.
##
## Do not assume that the internal representation is either, since
## 1L:n is stored as the integer vector c(NA, n) to save space (and
## the C-level code to get/set the attribute makes the appropriate
## translations.
This reminded me of something you often see in reproducible examples:
dput(dat)
# structure(list(x_char = c("e", "i", "n", "z", "w", "b", "j",
# "l", "o", "a"), x_int = c(4L, 3L, 6L, 2L, 7L, 10L, 5L, 8L, 9L,
# 1L)), class = "data.frame", row.names = c(NA, -10L))
We see that row names are indeed represented as a vector of length 2, row.names = c(NA, -10L)
. This is the key to understanding how to avoid the expensive copy operation.
How does creating a new attribute change things?
It doesn't. It simply creates a circumstance where you are more likely to copy row names, as attributes are not copied after every operation. R Internals states:
Subsetting (other than by an empty index) generally drops all attributes except names, dim and dimnames which are reset as appropriate.
Let's create a new attribute, foo
, and see what happens in memory:
attr(dat, "foo") <- TRUE
Let's look at the internal representation:
dat_foo_sxp <- sxp(dat)
get_dat_obj_tree(dat_foo_sxp)
1 dat VECSXP length: 2 mem_addr:0x7
2 ¦--x_char STRSXP length: 10 mem_addr:0x1
3 ¦--x_int INTSXP length: 10 mem_addr:0x2
4 °--_attrib LISTSXP length: 4 mem_addr:0x3
5 ¦--names STRSXP length: 2 mem_addr:0x4
6 ¦--class STRSXP length: 1 mem_addr:0x5
7 ¦--row.names INTSXP length: 2 mem_addr:0x6
8 °--foo LGLSXP length: 1 mem_addr:0x8
Nothing has truly changed in memory - the attributes class simply has a new node, of type LGLSXP
, i.e. a logical vector.
What happens when we subset the data frame?
Let's re-order the columns.
new <- dat[, c(2,1)]
Although we have selected all the columns, we are essentially subsetting the data by index. Let's look at the nodes of the object in memory:
new_sxp <- sxp(new)
get_dat_obj_tree(new_sxp, "new")
1 new VECSXP length: 2 mem_addr:0x12
2 ¦--x_int INTSXP length: 10 mem_addr:0x2
3 ¦--x_char STRSXP length: 10 mem_addr:0x1
4 °--_attrib LISTSXP length: 3 mem_addr:0x9
5 ¦--names STRSXP length: 2 mem_addr:0x10
6 ¦--class STRSXP length: 1 mem_addr:0x5
7 °--row.names INTSXP length: 2 mem_addr:0x11
This is broadly what we would expect from a lazily-evaluated copy apart from the row.names
, which have not changed but have a new memory address:
- The data frame itself has a new memory address.
- The memory address of the
integer
column is the same.
- The memory address of the
character
column is the same.
- The attributes pairlist has a new memory address.
- The
names
have a new new location (because it's re-ordered).
- The
class
has the same address.
- The
row.names
have a new memory address.
Perhaps R could have kept the row.names
in the same memory location. After all, we are only subsetting columns, so the number and order of rows is unchanged.
However, and this is why my previous suggestion to pre-allocate the row names was wrong, the fact that there are new row.names
does not significantly affect execution time. R is creating a new integer vector of length 2, regardless of the size of the data. This takes almost no time. It is probably not worth adding logic to the R source to establish whether the rows are the same, in order to avoid such a tiny operation.
So why does the example in the question take longer with larger data frames?
It is notable in your example, and the answer by Joris C., that operations take longer if they include attr(new, "row.names") <- attr(dat, "row.names")
, either individually or as part of a larger function call such as utils::modifyList(attributes(dat), attributes(new))
. Let's try the simple way:
attr(new, "row.names") <- attr(dat, "row.names")
get_dat_obj_tree(sxp(new))
1 dat VECSXP length: 2 mem_addr:0x15
2 ¦--x_int INTSXP length: 10 mem_addr:0x2
3 ¦--x_char STRSXP length: 10 mem_addr:0x1
4 °--_attrib LISTSXP length: 3 mem_addr:0x13
5 ¦--names STRSXP length: 2 mem_addr:0x10
6 ¦--class STRSXP length: 1 mem_addr:0x5
7 °--row.names INTSXP length: 2 mem_addr:0x14
There's a new memory address. But the row.names
attribute of new
is still an integer vector of length 2. If we run dput(new)
we will see row.names = c(NA, -10L)
.
So if we are copying an integer vector of length 2 from one place to another, regardless of the size of the data, why is it taking longer with larger data frames? The answer to this is what happens when you run:
attr(new, "row.names") <- attr(dat, "row.names")
This is syntactic sugar for:
new <- `attr<-`(new, "row.names", attr(dat, "row.names"))
Firstly, this means that we are evaluating the row.names
for dat
. Secondly, as R internals notes, with a similar example, a <- `dim<-`(a, c(7, 2))
:
in principle two copies of a
exist for the duration of the computation
So this may be happening twice.
Where is the evaluation happening?
An easier way to understand this is by printing the right-hand side of that function call.
`attr<-`(new, "row.names", attr(dat, "row.names"))
# <truncated>
# attr(,"row.names")
# [1] 1 2 3 4 5 6 7 8 9 10
By the time the row.names
are stored in new
, the R source code in attrib.c, it is clever enough to restore it to c(NA, n)
form:
INTEGER(val)[0] = NA_INTEGER;
INTEGER(val)[1] = n; // +n: compacted *and* automatic row names
However, the damage is done, the short form c(NA, -10)
row names were fully evaluated, which as you would expect (and have demonstrated) takes more time for longer vectors of row names.
Solutions
It is possible to avoid this issue in base R, and also with data.table
and tidyverse
packages.
base R solution
The main point is - do not copy the row names from one data frame to another. The function suggested by Joris C. to copy any attributes that were not copied by the subset operation, rather than copying all attributes, is a good base R solution.
data.table solution
An alternative is to convert the data frame to a data.table
and using data.table::setattr()
to set attributes by reference:
library(data.table)
orig <- data.frame(x1 = 1, x2 = 2)
setDT(orig)
mem_location <- tracemem(orig)
setattr(orig, "foo", TRUE)
tracemem(orig) == mem_location # TRUE
attr(orig, "foo") # TRUE
Additionally, with data.table
you can change the column order by reference so you do not lose the attributes when you reorder the columns:
setcolorder(orig, c(2,1))
attr(orig, "foo") # TRUE
orig
# x2 x1
# 1: 2 1
tidyverse solution
Similarly, a tibble()
keeps its row.names
attribute when you subset columns:
library(tibble)
set.seed(0)
num_rows <- 10
dat <- tibble(
x_char = sample(letters, num_rows),
x_int = sample(1:10, num_rows)
)
attr(dat, "foo") <- TRUE
new <- dat[,c(2,1)]
attr(new, "foo") # TRUE
I went down several dead-ends with this one, and posted two answers that were not quite right before I understood what was really happening under the hood. But I learned a lot about R in the process. Thanks for asking such an interesting question.