6

I need to drop one column from a data.frame containing a few hundred columns.

With a data.frame, I'd use subset to do this conveniently:

> dat <- data.table( data.frame(x=runif(10),y=rep(letters[1:5],2),z=runif(10)),key='y' )
> subset(dat,select=c(-z))
            x y
 1: 0.1969049 a
 2: 0.7916696 a
 3: 0.9095970 b
 4: 0.3529506 b
 5: 0.4923602 c
 6: 0.5993034 c
 7: 0.1559861 d
 8: 0.9929333 d
 9: 0.3980169 e
10: 0.1921226 e

Obviously this still works, but it seems like not a very data.table-like idiom. I could manually construct a list of the column names I wanted to keep, which seems a little more data.table-like:

> dat[,list(x,y)]
            x y
 1: 0.1969049 a
 2: 0.7916696 a
 3: 0.9095970 b
 4: 0.3529506 b
 5: 0.4923602 c
 6: 0.5993034 c
 7: 0.1559861 d
 8: 0.9929333 d
 9: 0.3980169 e
10: 0.1921226 e

But then I have to construct such a list, which is clunky.

Is subset the proper way to conveniently drop a column or two, or does it cause a performance hit? If not, what's the better way?

Edit

Benchmarks:

> dat <- data.table( data.frame(x=runif(10^7),y=rep(letters[1:10],10^6),z=runif(10^7)),key='y' )
> microbenchmark( subset(dat,select=c(-z)), dat[,list(x,y)] )
Unit: milliseconds
                         expr       min        lq    median        uq      max
1           dat[, list(x, y)] 102.62826 167.86793 170.72847 199.89789 792.0207
2 subset(dat, select = c(-z))  33.26356  52.55311  53.53934  55.00347 180.8740

But really where it may matter more is for memory if subset copies the whole data.table.

Ari B. Friedman
  • 71,271
  • 35
  • 175
  • 235
  • One question: what's your intention to drop this column? I mean what are you going to do with the resulting data.table? – Arun May 10 '13 at 01:00
  • @Arun The particular case that made me ask, I'd mis-named a column when creating with `:=` and wanted to drop it. But it comes up all the time that dumping a column or two is handy. – Ari B. Friedman May 10 '13 at 01:10
  • 1
    As @mnel explains, if you want to just remove, you can directly use the `:=` operator and if you want to perform computations without this column, then you can always use `.SDcols`. So, you can avoid `subset` / `with=FALSE` solution altogether. – Arun May 10 '13 at 01:16

2 Answers2

9

If you are wanting to remove the column permanently use := NULL

dat[, z := NULL]

If you have your columns to drop as a character string use () to force evaluation as a character string, not as the character name.

toDrop <- c('z')

dat[, (toDrop) := NULL]

If you want to limit the availability of the columns in .SD, you can pass the .SDcols argument

dat[,lapply(.SD, somefunction) , .SDcols = setdiff(names(dat),'z')]

However, data.table inspects the j arguments and only gets the columns you use any way. See FAQ 1.12

When you write X[Y,sum(foo*bar)], data.table automatically inspects the j expression to see which columns it uses.

and doesn't try and load all the data for .SD (unless you have .SD within your call to j)


subset.data.table is processing the call and eventually evaluating dat[, c('x','y'), with=FALSE]

using := NULL should be basically instantaneous, howveer t does permanently delete the column.

mnel
  • 113,303
  • 27
  • 265
  • 254
  • 2
    mnel said: "Use`()`to force evaluation as a character string".... I've been trying to figure out how to do that for weeks. That little tip is worthy of its own Q&A. – D. Woods Aug 12 '13 at 18:41
1

I think this is what you're looking for.

dat[, !"z"]

Here's a benchmark on the huge data from your edit.

Unit: milliseconds
                         expr       min        lq    median       uq      max neval
  subset(dat, select = c(-z))  53.37435  56.82514  61.81279 100.3458 339.1400   100
            dat[, list(x, y)] 191.46678 354.39905 412.06421 451.3933 678.3981   100
                  dat[, !"z"]  53.49184  57.31756  62.15506 112.7063 398.0107   100
MichaelChirico
  • 33,841
  • 14
  • 113
  • 198
Arun
  • 116,683
  • 26
  • 284
  • 387
  • `subset.data.table` is identical to your call because they are doing the same thing see `getAnywhere('subset.data.table')` – mnel May 10 '13 at 00:46
  • I prefer this syntax to `subset` because of [**this post**](http://stackoverflow.com/questions/9860090/in-r-why-is-better-than-subset). I assumed the same argument could be extended to `data.table` – Arun May 10 '13 at 00:59
  • @Arun Care to explain how this works? Why's the `with` option need to be `FALSE`, for instance? – Ari B. Friedman May 10 '13 at 01:29
  • When `with=TRUE` (default), `j` is evaluated within the scope of your data.table. That is, it "sees" the columns of your data.table as variables. To overcome this default and get the same behaviour as in a data.frame, we set `with=FALSE`. If not, `DT[, "x"]` would just give you back `"x"` as `"x"` evaluates to `"x"`. [**Faq 1.1 and 1.2**](http://datatable.r-forge.r-project.org/datatable-faq.pdf) may be worth a read for more information. – Arun May 10 '13 at 01:37