1

TL;DR: What's the fastest/most memory efficient way to convert all factor columns in a data.table to character columns?

To preface this, I have a suspicion that this question may be answered elsewhere I'm struggling to find anything definitive, so if anyone can point me in the right direction I'd appreciate that equally as much as an exhaustive answer.

I frequently need to convert all data.table factor columns to character columns. Having seen Matt Dowle suggest using set for this very task like in the comment on this answer, I had assumed that that was the fastest/most efficient manner.

However, when I did some testing, I was surprised by the results. Methodology is as follows:

Test Data Generation

library(data.table)
set.seed(1234)

## Create a vector of 1 million 4 character strings
## with 456,976 possible unique values 
DiverseSize <- 1e6
Diverse <- paste0(sample(LETTERS,DiverseSize,replace = TRUE),
                  sample(letters,DiverseSize,replace = TRUE),
                  sample(letters,DiverseSize,replace = TRUE),
                  sample(letters,DiverseSize,replace = TRUE))

## Create a vector of 10 million single character strings
## with 26 possible unique values
CommonSize  <- 1e7
Common <-  sample(LETTERS,CommonSize,replace = TRUE)

## Mix them into a data.table columns, "x0" through "x9"
DT_Original<- data.table(x0 = as.factor(sample(c(Diverse,Common),size = CommonSize + DiverseSize, replace = FALSE)),
                         x1 = as.factor(sample(c(Diverse,Common),size = CommonSize + DiverseSize, replace = FALSE)),
                         x2 = as.factor(sample(c(Diverse,Common),size = CommonSize + DiverseSize, replace = FALSE)),
                         x3 = as.factor(sample(c(Diverse,Common),size = CommonSize + DiverseSize, replace = FALSE)),
                         x4 = as.factor(sample(c(Diverse,Common),size = CommonSize + DiverseSize, replace = FALSE)),
                         x5 = as.factor(sample(c(Diverse,Common),size = CommonSize + DiverseSize, replace = FALSE)),
                         x6 = as.factor(sample(c(Diverse,Common),size = CommonSize + DiverseSize, replace = FALSE)),
                         x7 = as.factor(sample(c(Diverse,Common),size = CommonSize + DiverseSize, replace = FALSE)),
                         x8 = as.factor(sample(c(Diverse,Common),size = CommonSize + DiverseSize, replace = FALSE)),
                         x9 = as.factor(sample(c(Diverse,Common),size = CommonSize + DiverseSize, replace = FALSE)))

DT1 <- copy(DT_Original)
DT2 <- copy(DT_Original)
DT3 <- copy(DT_Original)

Functions

unfactorize <- function(df){
  for(i in which(sapply(df, class) == "factor")) df[[i]] = as.character(df[[i]])
  return(df)
}

set_unfactorize <- function(df){
  for(col in names(df)[which(sapply(df, class) == "factor")]) set(df, j = col, value = as.character(df[[col]]))
}

Execution

I then profiled the following three statements using RStudio's built in profiling tools.

## Original
DT1 <- unfactorize(DT1)
## data.table::set version
set_unfactorize(DT2)
## Outside of function
for(col in names(DT3)[which(sapply(DT3, class) == "factor")]) set(DT3, j = col, value = as.character(DT3[[col]]))

Profiling Results

I was very surprised by the outcome-- the base R version seems to perform the fastest and use the least memory even though I would expect it to require a copy. Is this right, or am I missing something here?

Profiling Comparison

Matt Summersgill
  • 4,054
  • 18
  • 47
  • 4
    data.table columns should never be factors unless you explicitly set them yourself, so the fastest way would be to just remove the code that converts them to a factor – Scott Ritchie Aug 15 '18 at 17:55
  • 2
    In many cases, what Scott said means using stringsAsFactors = FALSE when reading data in. – Frank Aug 15 '18 at 17:56
  • Using R Services on SQL Server to apply R functions on tables in database, the data.frame object passed from the SQL to R runtime environment treats character columns as factors by default. I'm looking for a global option to change that behavior, but I'm also interested in this from a purely academic standpoint since what I found conflicted with my expectations. – Matt Summersgill Aug 15 '18 at 18:01
  • Related: [*`Convert *some* column classes in data.table`*](https://stackoverflow.com/q/32940580/2204410) – Jaap Aug 15 '18 at 18:03

0 Answers0