1

I want to change 12 variables with 10+mill rows, so I'm looking for something fast.

Sample dataset

library(data.table)
set.seed(1)
(DT <- data.table(
  V1 = sample(LETTERS[1:3], size=10, replace=TRUE),
  V2 = sample(letters[5:9], size=10, replace=TRUE)))

str(DT)
Classes ‘data.table’ and 'data.frame':  10 obs. of  2 variables:
 $ V1: chr  "A" "C" "A" "B" ...
 $ V2: chr  "g" "e" "i" "i" ...
 - attr(*, ".internal.selfref")=<externalptr> 

I want to change V1 and V2 (12 variables in real dataset with 10+ mill rows) to ordered factors. This works, but I know I shouldn't...

DT[, V1 := factor(V1, levels = sort(unique(V1)), ordered = TRUE)]
DT[, V2 := factor(V2, levels = sort(unique(V2)), ordered = TRUE)]

str(DT)
Classes ‘data.table’ and 'data.frame':  10 obs. of  2 variables:
 $ V1: Ord.factor w/ 3 levels "A"<"B"<"C": 1 3 1 2 1 3 3 2 2 3
 $ V2: Ord.factor w/ 4 levels "e"<"f"<"g"<"i": 3 1 4 4 2 2 1 4 4 1
 - attr(*, ".internal.selfref")=<externalptr> 

From this post I know that DT[, .N, by = V1][order(-N), V1] is faster than sort(unique(x)). So I said "piece of cake"...

cols <- c("V1", "V2")
for (i in cols) {
  # faster than sort(unique(x))
  levs <- DT[, .N, by = mget(i)][order(-N), mget(i)][[1]]

  # all variables in cols, to ordered factor
  DT[, mget(i) := factor(mget(i), levels = levs, ordered = TRUE)]
  }

... and it doesn't work. I've also tried building the unique sorted levels with lapply levs <- lapply(DT[,..cols), function(x) { sort(unique(x)) }) but I cant assign the levels to each variable

=(

504aldo
  • 57
  • 6
  • 2
    Fyi, sort unique will put them in alphanumeric order, while Uwe's way (from your link) sorts by decreasing frequency. You can compare your own results above ("e"<"f"<"g"<"i") vs below ("g"<"e"<"i"<"f"). Oh, actually, your approach below just sorts by first appearance. Anyway, I guess it matters which you use if you're making an ordered factor. – Frank Oct 21 '19 at 00:07
  • @Frank, you are right! I'm trying to sort in alphanumeric, but my code wasn't working and when I finally made it work I rushed into answering my own question without noticing that my code was throwing the wrong order. I didn't notice untill recently. Do you know a "faster way" to do `sort(unique(x))` with multiple variables? - `sort(unique(V1))` works, but I want to learn the _proper_ way to do it – 504aldo Oct 21 '19 at 02:55
  • Hm, I think that sort(unique(x)) is generally the fastest way. However, I think you can just leave out the `levels =` argument and get the intended result. I'll post an answer illustrating – Frank Oct 21 '19 at 03:24
  • If it can help, there is `kit::funique` which is faster than `base::unique` and `kit::charToFact` which is faster than `base::as.factor` for converting character vector to factors. – Suresh_Patel Mar 07 '21 at 17:32

2 Answers2

2

I think that sort(unique(x)) is probably the fastest idiomatic way to go, though the link in the OP shows benchmarks favoring another approach that might be worth looking into if speed is critical.

For the OP's case of making an ordered factor with alphanumerically ordered levels, though, we don't need to explicitly specify the levels. From ?factor:

The default [for the levels parameter] is the unique set of values taken by as.character(x), sorted into increasing order of x.

Also, we can reduce code repetition using lapply and .SD:

cols = c("V1", "V2")
DT[, (cols) := lapply(.SD, factor, ordered=TRUE), .SDcols=cols]
Frank
  • 66,179
  • 8
  • 96
  • 180
0

I found the error =( so embarrasing..

cols <- c("V1", "V2")
for (i in cols) {
  levs <- DT[, .N, by = mget(i)][, mget(i)][[1]]
  DT[, (i) := factor(get(i), levels = levs, ordered = TRUE)]
}

str(DT)
Classes ‘data.table’ and 'data.frame':  10 obs. of  2 variables:
 $ V1: Ord.factor w/ 3 levels "A"<"C"<"B": 1 2 1 3 1 2 2 3 3 2
 $ V2: Ord.factor w/ 4 levels "g"<"e"<"i"<"f": 1 2 3 3 4 4 2 3 3 2
 - attr(*, ".internal.selfref")=<externalptr> 
504aldo
  • 57
  • 6