3

I want to perform a data.table task over and over in a function call: Reduce number of levels for large categorical variables My problem is similar to Data.table and get() command (R) or pass column name in data.table using variable in R but I can't get it to work

Without a function call this works just fine:

# Load data.table
require(data.table)

# Some data
set.seed(1)
dt <- data.table(type = factor(sample(c("A", "B", "C"), 10e3, replace = T)),
                 weight = rnorm(n = 10e3, mean = 70, sd = 20))

# Decide the minimum frequency a level needs...
min.freq <- 3350

# Levels that don't meet minumum frequency (using data.table)
fail.min.f <- dt[, .N, type][N < min.freq, type]

# Call all these level "Other"
levels(dt$type)[fail.min.f] <- "Other"

but wrapped like

reduceCategorical <- function(variableName, min.freq){
  fail.min.f <- dt[, .N, variableName][N < min.freq, variableName]
  levels(dt[, variableName][fail.min.f]) <- "Other"
}

I only get errors like:

 reduceCategorical(dt$x, 3350)
Fehler in levels(df[, variableName][fail.min.f]) <- "Other" : 
 trying to set attribute of NULL value

And sometimes

Error is: number of levels differs
Community
  • 1
  • 1
Georg Heiler
  • 16,916
  • 36
  • 162
  • 292
  • It is always better to use a `data.table` syntax when working with `data.table`.... – David Arenburg Aug 22 '16 at 08:00
  • What are you referring to? `df[, variableName][fail.min.f]` is correct data.table isn't it? – Georg Heiler Aug 22 '16 at 08:01
  • No, it's not correct way to work with factors. You could do this in two steps, but I haven't tested for efficiency: `dt[type %in% fail.min.f, type := "Other"]; dt[, type := factor(type)]` I'll try to think of a better way though – David Arenburg Aug 22 '16 at 08:10
  • Strange - your method is setting all values in the type column to Other. – Georg Heiler Aug 22 '16 at 08:20
  • No it's not.... Btw, your function doesn't have a `df` variable and you should be aware that `:=` modifies data.tables within functions too – David Arenburg Aug 22 '16 at 08:22
  • thanks - edited it. – Georg Heiler Aug 22 '16 at 08:28
  • In single use case, your solution works just fine. But within the function everything is marked as Other. – Georg Heiler Aug 22 '16 at 08:31
  • 1
    `f <- function(var, min.freq) {fail.min.f <- dt[, .N, by = var][N < min.freq, get(var)];dt[get(var) %in% fail.min.f, (var) := "Other"];dt[, (var) := factor(get(var))]};f("type", min.freq)` – David Arenburg Aug 22 '16 at 08:44
  • Another version of the same function `f <- function(var, min.freq) {fail.min.f <- dt[, .I[.N < min.freq], by = var]$V1;set(dt, fail.min.f, var, "other");set(dt, NULL, var, factor(dt[[var]]))}` – David Arenburg Aug 22 '16 at 08:48
  • Thanks a lot - that works. Which one of the two would you recommend? – Georg Heiler Aug 22 '16 at 08:48
  • The second one. But there must be a better way – David Arenburg Aug 22 '16 at 08:50
  • 1
    Try also `f <- function(variableName, min.freq){fail.min.f <- dt[, .N, by = variableName][N < min.freq, get(variableName)];invisible(dt[, setattr(get(variableName), "levels", c(setdiff(levels(get(variableName)), fail.min.f), rep("other", length(fail.min.f))))])}` - it seems like the best option to me – David Arenburg Aug 22 '16 at 10:13
  • Why does the last variant only work if everything is in a single line? – Georg Heiler Aug 22 '16 at 10:18
  • Not sure what you mean. You could break it to lined where `;` is and remove it – David Arenburg Aug 22 '16 at 10:21
  • Indeed that's what I tried but then everything was labelled as "other" – Georg Heiler Aug 22 '16 at 10:21
  • It modifies everything in place. Once you run it once, your data set will be already modified *in place*. The second time you'll run it, it will be already on the previously modified data set. If you don't want to modify in place just use the base R syntax: `f <- function(df, variableName, min.freq){fail.min.f <- df[, .N, by = variableName][N < min.freq, get(variableName)];levels(df$type)[fail.min.f] <- "Other";df} ;f(dt, "type", min.freq)`. This will leave your original `dt` unmodified and return a whole new data set – David Arenburg Aug 22 '16 at 10:24
  • ok - so there must have been an issue on my side – Georg Heiler Aug 22 '16 at 10:26
  • @DAvid do you want to write your last function as "the solution" ? – Georg Heiler Aug 22 '16 at 10:31
  • Why, you like the base R solution best? So why do you need the `data.table` package here in the first place :) – David Arenburg Aug 22 '16 at 10:32
  • sorry - I did mean the one from 19 mins ago with data.table ;) But choose the one you see fit. I will go with the data.table version. – Georg Heiler Aug 22 '16 at 10:34
  • 1
    Btw, do you insist on `factors`? You could do the whole process in a single simple line if `type` were a `character`. Something like `dt[, type := as.character(type)] ; dt[, type := if(.N < min.freq) "other", by = type]` – David Arenburg Aug 22 '16 at 10:58
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/121542/discussion-between-geoheil-and-david-arenburg). – Georg Heiler Aug 22 '16 at 14:34

2 Answers2

3

One possibility is to define your own re-leveling function using data.table::setattr that will modify dt in place. Something like

DTsetlvls <- function(x, newl)  
   setattr(x, "levels", c(setdiff(levels(x), newl), rep("other", length(newl))))

Then use it within another predefined function

f <- function(variableName, min.freq){
  fail.min.f <- dt[, .N, by = variableName][N < min.freq, get(variableName)]
  dt[, DTsetlvls(get(variableName), fail.min.f)]
  invisible()
}

f("type", min.freq)
levels(dt$type)
# [1] "C"     "other"

Some other data.table alternatives

f <- function(var, min.freq) {
  fail.min.f <- dt[, .N, by = var][N < min.freq, get(var)]
  dt[get(var) %in% fail.min.f, (var) := "Other"]
  dt[, (var) := factor(get(var))]
}

Or using set/.I

f <- function(var, min.freq) {
  fail.min.f <- dt[, .I[.N < min.freq], by = var]$V1
  set(dt, fail.min.f, var, "other")
  set(dt, NULL, var, factor(dt[[var]]))
}

Or combining with base R (doesn't modify original data set)

f <- function(df, variableName, min.freq){
  fail.min.f <- df[, .N, by = variableName][N < min.freq, get(variableName)]
  levels(df$type)[fail.min.f] <- "Other"
  df
} 

Alternatively, we could stick we characters instead (if type is a character), you could simply do

f <- function(var, min.freq) dt[, (var) := if(.N < min.freq) "other", by = var]
David Arenburg
  • 91,361
  • 17
  • 137
  • 196
1

You are referencing things little differently in the wrapper, to get "type" column name you are using the whole variableName which is actually a vector same with getting levels, you are not using variableName directly as done in function

The error is because value of fail.min.f is coming NULL owing to referencing.

wololo
  • 345
  • 2
  • 12
  • I used dt$x e.g. the whole name in order to prevent the error of using `reduceCategorical('x', 3350)` which results in the following error: In `[<-.data.table`(`*tmp*`, , variableName, value = c("x", NA)) : Supplied 2 items to be assigned to 3 items of column 'x' (recycled leaving remainder of 1 items). – Georg Heiler Aug 22 '16 at 07:58
  • In case `fail.min.f` is the last statement - it is not null and prints the desired result. – Georg Heiler Aug 22 '16 at 07:59