27

I want to convert a subset of data.table cols to a new class. There's a popular question here (Convert column classes in data.table) but the answer creates a new object, rather than operating on the starter object.

Take this example:

dat <- data.frame(ID=c(rep("A", 5), rep("B",5)), Quarter=c(1:5, 1:5), value=rnorm(10))
cols <- c('ID', 'Quarter')

How best to convert to just the cols columns to (e.g.) a factor? In a normal data.frame you could do this:

dat[, cols] <- lapply(dat[, cols], factor)

but that doesn't work for a data.table, and neither does this

dat[, .SD := lapply(.SD, factor), .SDcols = cols]

A comment in the linked question from Matt Dowle (from Dec 2013) suggests the following, which works fine, but seems a bit less elegant.

for (j in cols) set(dat, j = j, value = factor(dat[[j]]))

Is there currently a better data.table answer (i.e. shorter + doesn't generate a counter variable), or should I just use the above + rm(j)?

Community
  • 1
  • 1
arvi1000
  • 9,393
  • 2
  • 42
  • 52
  • I believe method Matt Dowle recommends would be best. He is, after all, the *data.table* author. – Rich Scriven Oct 05 '15 at 02:47
  • 1
    True, but the comment was from 2013 and there have been many package updates since then, so I thought it worth throwing this fishing line out – arvi1000 Oct 05 '15 at 02:55
  • 2
    More detail on the `for(...) set(...)` idiom added recently here: http://stackoverflow.com/a/33000778/403310 – Matt Dowle Nov 23 '15 at 10:59

2 Answers2

53

Besides using the option as suggested by Matt Dowle, another way of changing the column classes is as follows:

dat[, (cols) := lapply(.SD, factor), .SDcols = cols]

By using the := operator you update the datatable by reference. A check whether this worked:

> sapply(dat,class)
       ID   Quarter     value 
 "factor"  "factor" "numeric" 

As suggeted by @MattDowle in the comments, you can also use a combination of for(...) set(...) as follows:

for (col in cols) set(dat, j = col, value = factor(dat[[col]]))

which will give the same result. A third alternative is:

for (col in cols) dat[, (col) := factor(dat[[col]])]

On a smaller datasets, the for(...) set(...) option is about three times faster than the lapply option (but that doesn't really matter, because it is a small dataset). On larger datasets (e.g. 2 million rows), each of these approaches takes about the same amount of time. For testing on a larger dataset, I used:

dat <- data.table(ID=c(rep("A", 1e6), rep("B",1e6)),
                  Quarter=c(1:1e6, 1:1e6),
                  value=rnorm(10))

Sometimes, you will have to do it a bit differently (for example when numeric values are stored as a factor). Then you have to use something like this:

dat[, (cols) := lapply(.SD, function(x) as.integer(as.character(x))), .SDcols = cols]


WARNING: The following explanation is not the data.table-way of doing things. The datatable is not updated by reference because a copy is made and stored in memory (as pointed out by @Frank), which increases memory usage. It is more an addition in order to explain the working of with = FALSE.

When you want to change the column classes the same way as you would do with a dataframe, you have to add with = FALSE as follows:

dat[, cols] <- lapply(dat[, cols, with = FALSE], factor)

A check whether this worked:

> sapply(dat,class)
       ID   Quarter     value 
 "factor"  "factor" "numeric" 

If you don't add with = FALSE, datatable will evaluate dat[, cols] as a vector. Check the difference in output between dat[, cols] and dat[, cols, with = FALSE]:

> dat[, cols]
[1] "ID"      "Quarter"

> dat[, cols, with = FALSE]
    ID Quarter
 1:  A       1
 2:  A       2
 3:  A       3
 4:  A       4
 5:  A       5
 6:  B       1
 7:  B       2
 8:  B       3
 9:  B       4
10:  B       5
Jaap
  • 81,064
  • 34
  • 182
  • 193
  • 1
    I think you don't want to use `<-`. If I run `address()` before and after it, it looks like the modification is not by reference (even though `[<-.data.table` does seem to cover this usage). – Frank Oct 05 '15 at 12:50
  • 1
    @Frank I know, but I added the second part primarily to explain why you need `with=FALSE` when you want a datatable to return the columns the same way as a dataframe. I added a warning. – Jaap Oct 05 '15 at 13:18
  • Yeah, I know how `with` works, but that's not the `data.table` native way, as you point out. Thanks for the main answer – arvi1000 Oct 05 '15 at 15:01
  • 3
    Just using `lapply` on the RHS of `:=` can be the memory inefficiency, not really `:=` itself. More details [here](http://stackoverflow.com/a/33000778/403310) on why `for(...)set(...)` uses least memory. In this case though, the call to `factor()` on each column may dominate. – Matt Dowle Nov 23 '15 at 11:06
  • @MattDowle Thanx. Added to my answer (with a remark about performance as well). – Jaap Nov 23 '15 at 14:56
  • 1
    @Frank Added a `set` alternative. – Jaap Nov 23 '15 at 14:58
  • 4
    @Jaap New edit is good other than it's not just about speed but peak memory usage as well. For larger datasets it can make a difference between working or not working (out-of-memory error with the `:=lapply` way). Thanks to these questions and comments, I filed an issue to optimize this so the user doesn't need to know these differences, [#1441](https://github.com/Rdatatable/data.table/issues/1441). – Matt Dowle Nov 24 '15 at 00:38
3

You can use .SDcols:

dat[, cols] <- dat[, lapply(.SD, factor), .SDcols=cols]

Chris
  • 6,302
  • 1
  • 27
  • 54
  • Any reason this is getting downvotes? It seems to get the result the asker is looking for – Chris Oct 05 '15 at 12:48
  • 3
    Same as my comment on the other answer. One big advantage of using a data.table is modification by reference, but as far as I can tell, `[<-` with one fails to take advantage of that. – Frank Oct 05 '15 at 12:52
  • @Chris people here take the "no short answers" guideline a little too seriously. Although there's almost always room to expand/expound. – shadowtalker Oct 05 '15 at 15:29