Object size for characters in R - How does R global string pool work?

Question

I am reading Hadley's Advanced R Programming and when it discusses the memory size for characters it says this:

R has a global string pool. This means that each unique string is only stored in one place, and therefore character vectors take up less memory than you might expect.

The example the book gives is this:

library(pryr)
object_size("banana")
#> 96 B
object_size(rep("banana", 10))
#> 216 B

One of the exercises in this section is to compare these two character vectors:

vec <- lapply(0:50, function(i) c("ba", rep("na", i)))
str <- lapply(vec, paste0, collapse = "")

object_size(vec)
13.4 kB

object_size(str)
8.74 kB

Now, since the passage states that R has a global string pool, and since vector vec is composed mainly of repetitions of two strings ("ba" and "na") I actually would - intuitively - expect the size of vec to be smaller than the size of str.

So my question is: how could you most accurately estimate the size of those vectors beforehand?

This is just thinking out loud, but I bet this depends on the size of the string pool prior to instantiating the vector. Have you done any experiments testing the interaction between the length of the vector, the (cumulative) lengths of the strings in that vector, and whether or not some or all of the strings are alread in the string pool (i.e. x <- 'foo', y = c('foo','bar')) etc.? Also this might be platform dependent, as I get totally different sizes for the objects: fore me `object_size(vec)` yields `7.42 kB` and `object_size(str)` yields `6.89 kB`. — Jthorpe, Apr 17 '15 at 16:53

score 3 · Answer 1 · edited Jun 20 '20 at 09:12

The key difference is because of the pointers in vec: each of the short scalar strings (CHARSXPs) has to be pointed from the corresponding string vector (STRSXP). You have some 1326 of such string pointers inside vec, but only 51 in str (a pointer is probably 8 bytes on your platform). The pool is for scalar strings (aka CHARSXP cache). Another non-obvious factor is internal fragmentation, e.g. on my system, a scalar string takes the same size regardless of whether it has zero to 7 characters, an 8 character string only takes more, and so on. See the repeated sizes in the following:

unlist(sapply(str, object.size))

[1] 96 96 96 104 104 104 104 120 120 120 120 120 120 120 120 136 136 136 136

[20] 136 136 136 136 152 152 152 152 152 152 152 152 216 216 216 216 216 216 216

[39] 216 216 216 216 216 216 216 216 216 216 216 216 216

These are, however, implementation details of R's memory manager that could change and one should not depend on them in any way in user programs - with another object layout/memory manager, str could use more space than vec.

Object size for characters in R - How does R global string pool work?

1 Answers1

Linked