Problem. When I work with large datasets (millions of rows) in R
, I use the data.table
package. Recently, I have had to work with string identifiers (such as "AEREOBCOIRE045451O34") that have low cardinality in the sense of unique(x)/length(x)
. There is a question of which type is appropriate for storing such identifiers: character
or factor
?
In this answer, Matt Dowle explains that operations in data.table
have been optimized for character
. After reading it, my take-away was that I should always use character
identifiers. However
- This comment by @MichaelChirico suggests that this reasoning is outdated.
- There can be a significant memory gain in using
factors
(about 20% in the reproducible example below).
Question. Given the reproducible example below, there are significant gains in terms of memory from switching to factor
-type identifiers. Is there a trade-off between memory and speed here? Specifically, since Matt Dowle explains that some operations are optimized for characters
, what would be the costs of using factors
instead?
Additional context. The issue of characters
vs. factors
has been discussed a lot on Stack Overflow (see for example here (1), here (2) or here (3); there are many others). The advice provided has evolved quite a bit over time, and, as of today, it's not clear from reading previous answers what the best practice is.
Reproducible example for memory usage.
library(data.table)
library(pryr)
set.seed(1234)
N <- 1e7
vec_id <- stringi::stri_rand_strings(100, 40)
id_lowcard <- sample(vec_id, size = N, replace = TRUE)
v1 <- runif(N)
v2 <- rnorm(N)
A <- data.table(id_lowcard,
v1,
v2)
B <- data.table(as.factor(id_lowcard),
v1,
v2)
cat("Memory gain: ", round((object_size(A) / object_size(B) - 1) * 100), "%")