TL;DR: What's the fastest/most memory efficient way to convert all factor columns in a data.table to character columns?
To preface this, I have a suspicion that this question may be answered elsewhere I'm struggling to find anything definitive, so if anyone can point me in the right direction I'd appreciate that equally as much as an exhaustive answer.
I frequently need to convert all data.table factor columns to character columns. Having seen Matt Dowle suggest using set
for this very task like in the comment on this answer, I had assumed that that was the fastest/most efficient manner.
However, when I did some testing, I was surprised by the results. Methodology is as follows:
Test Data Generation
library(data.table)
set.seed(1234)
## Create a vector of 1 million 4 character strings
## with 456,976 possible unique values
DiverseSize <- 1e6
Diverse <- paste0(sample(LETTERS,DiverseSize,replace = TRUE),
sample(letters,DiverseSize,replace = TRUE),
sample(letters,DiverseSize,replace = TRUE),
sample(letters,DiverseSize,replace = TRUE))
## Create a vector of 10 million single character strings
## with 26 possible unique values
CommonSize <- 1e7
Common <- sample(LETTERS,CommonSize,replace = TRUE)
## Mix them into a data.table columns, "x0" through "x9"
DT_Original<- data.table(x0 = as.factor(sample(c(Diverse,Common),size = CommonSize + DiverseSize, replace = FALSE)),
x1 = as.factor(sample(c(Diverse,Common),size = CommonSize + DiverseSize, replace = FALSE)),
x2 = as.factor(sample(c(Diverse,Common),size = CommonSize + DiverseSize, replace = FALSE)),
x3 = as.factor(sample(c(Diverse,Common),size = CommonSize + DiverseSize, replace = FALSE)),
x4 = as.factor(sample(c(Diverse,Common),size = CommonSize + DiverseSize, replace = FALSE)),
x5 = as.factor(sample(c(Diverse,Common),size = CommonSize + DiverseSize, replace = FALSE)),
x6 = as.factor(sample(c(Diverse,Common),size = CommonSize + DiverseSize, replace = FALSE)),
x7 = as.factor(sample(c(Diverse,Common),size = CommonSize + DiverseSize, replace = FALSE)),
x8 = as.factor(sample(c(Diverse,Common),size = CommonSize + DiverseSize, replace = FALSE)),
x9 = as.factor(sample(c(Diverse,Common),size = CommonSize + DiverseSize, replace = FALSE)))
DT1 <- copy(DT_Original)
DT2 <- copy(DT_Original)
DT3 <- copy(DT_Original)
Functions
unfactorize <- function(df){
for(i in which(sapply(df, class) == "factor")) df[[i]] = as.character(df[[i]])
return(df)
}
set_unfactorize <- function(df){
for(col in names(df)[which(sapply(df, class) == "factor")]) set(df, j = col, value = as.character(df[[col]]))
}
Execution
I then profiled the following three statements using RStudio's built in profiling tools.
## Original
DT1 <- unfactorize(DT1)
## data.table::set version
set_unfactorize(DT2)
## Outside of function
for(col in names(DT3)[which(sapply(DT3, class) == "factor")]) set(DT3, j = col, value = as.character(DT3[[col]]))
Profiling Results
I was very surprised by the outcome-- the base R version seems to perform the fastest and use the least memory even though I would expect it to require a copy. Is this right, or am I missing something here?