Should I use factors or characters for identifiers in large data.table datasets?

Question

Problem. When I work with large datasets (millions of rows) in R, I use the data.table package. Recently, I have had to work with string identifiers (such as "AEREOBCOIRE045451O34") that have low cardinality in the sense of unique(x)/length(x). There is a question of which type is appropriate for storing such identifiers: character or factor?

In this answer, Matt Dowle explains that operations in data.table have been optimized for character. After reading it, my take-away was that I should always use character identifiers. However

This comment by @MichaelChirico suggests that this reasoning is outdated.
There can be a significant memory gain in using factors (about 20% in the reproducible example below).

Question. Given the reproducible example below, there are significant gains in terms of memory from switching to factor-type identifiers. Is there a trade-off between memory and speed here? Specifically, since Matt Dowle explains that some operations are optimized for characters, what would be the costs of using factors instead?

Additional context. The issue of characters vs. factors has been discussed a lot on Stack Overflow (see for example here (1), here (2) or here (3); there are many others). The advice provided has evolved quite a bit over time, and, as of today, it's not clear from reading previous answers what the best practice is.

Reproducible example for memory usage.

library(data.table)
library(pryr)
set.seed(1234)

N <- 1e7
vec_id <- stringi::stri_rand_strings(100, 40)
id_lowcard <- sample(vec_id, size = N, replace = TRUE)
v1 <- runif(N)
v2 <- rnorm(N)

A <- data.table(id_lowcard,
                v1,
                v2)
B <- data.table(as.factor(id_lowcard),
                v1,
                v2)

cat("Memory gain: ", round((object_size(A) / object_size(B) - 1) * 100), "%")

This question is more suitable at [data.table GitHub](https://github.com/Rdatatable/data.table/issues) — zx8754, Jul 20 '21 at 12:27
@zx8754 Thanks for the comment, I will consider asking it there. Given that this issue has been discussed a lot on Stack Overflow in the past, I think that others could benefit from answers. I will edit the question to provide more context on this. — charlus, Jul 20 '21 at 12:55
I have edited the question substantially. I think that this should make the question much less "opnion-based" and that it could be re-opened. I would be happy to address additional concerns if any should be raised. — charlus, Jul 20 '21 at 13:18
I'm reviewing as "Leave closed" because the question seems to lack clarity, specifically "conflict with character-based optimization" is unclear. [From Review](https://stackoverflow.com/review/reopen/29431659) — Ian Campbell, Aug 15 '21 at 22:11
@IanCampbell I have edited the overall question to make it clearer. The part that you found unclear referred to this (linked) answer by Matt Dowle: https://stackoverflow.com/questions/34862856/are-factors-stored-more-efficiently-in-data-table-than-characters/34865113#34865113 — charlus, Aug 17 '21 at 13:28

Should I use factors or characters for identifiers in large data.table datasets?

0 Answers0