I want to change 12 variables with 10+mill rows, so I'm looking for something fast.
Sample dataset
library(data.table)
set.seed(1)
(DT <- data.table(
V1 = sample(LETTERS[1:3], size=10, replace=TRUE),
V2 = sample(letters[5:9], size=10, replace=TRUE)))
str(DT)
Classes ‘data.table’ and 'data.frame': 10 obs. of 2 variables:
$ V1: chr "A" "C" "A" "B" ...
$ V2: chr "g" "e" "i" "i" ...
- attr(*, ".internal.selfref")=<externalptr>
I want to change V1
and V2
(12 variables in real dataset with 10+ mill rows) to ordered factors. This works, but I know I shouldn't...
DT[, V1 := factor(V1, levels = sort(unique(V1)), ordered = TRUE)]
DT[, V2 := factor(V2, levels = sort(unique(V2)), ordered = TRUE)]
str(DT)
Classes ‘data.table’ and 'data.frame': 10 obs. of 2 variables:
$ V1: Ord.factor w/ 3 levels "A"<"B"<"C": 1 3 1 2 1 3 3 2 2 3
$ V2: Ord.factor w/ 4 levels "e"<"f"<"g"<"i": 3 1 4 4 2 2 1 4 4 1
- attr(*, ".internal.selfref")=<externalptr>
From this post I know that DT[, .N, by = V1][order(-N), V1]
is faster than sort(unique(x))
. So I said "piece of cake"...
cols <- c("V1", "V2")
for (i in cols) {
# faster than sort(unique(x))
levs <- DT[, .N, by = mget(i)][order(-N), mget(i)][[1]]
# all variables in cols, to ordered factor
DT[, mget(i) := factor(mget(i), levels = levs, ordered = TRUE)]
}
... and it doesn't work. I've also tried building the unique sorted levels with lapply levs <- lapply(DT[,..cols), function(x) { sort(unique(x)) })
but I cant assign the levels to each variable
=(