Selecting a subset of columns in a data.table

Question

I'd like to print all the columns of a data table dt except one of them named V3 but don't want to refer to it by number but by name. This is the code that I have:

  dt = data.table(matrix(sample(c(0,1),5,rep=T),50,10))
  dt[,-3,with=FALSE]   #  Is this the only way to not print column "V3"?

Using the data frame way, one could do this through the code:

  df = data.frame(matrix(sample(c(0,1),5,rep=T),50,10))
  df[,!(colnames(df)%in% c("X3"))]

So, my question is: is there another way to not print one column in a data table without the necessity of refer to it by number? I'd like to find something similar to the data frame syntax I used above but using data table.

Possible duplicate of [Select subset of columns in data.table R](https://stackoverflow.com/questions/28094645/select-subset-of-columns-in-data-table-r) — hhh, Jul 01 '17 at 00:08
@hhh that's not a correct duplicate; it should be closed the other way around — Jaap, Jul 01 '17 at 07:13
@Jaap I know but the quality of the newer question thread is better hence I voted other way around. — hhh, Jul 01 '17 at 17:16

score 51 · Accepted Answer · edited Jul 01 '17 at 09:29

Use a very similar syntax as for a data.frame, but add the argument with=FALSE:

dt[, setdiff(colnames(dt),"V9"), with=FALSE]
    V1 V2 V3 V4 V5 V6 V7 V8 V10
 1:  1  1  1  1  1  1  1  1   1
 2:  0  0  0  0  0  0  0  0   0
 3:  1  1  1  1  1  1  1  1   1
 4:  0  0  0  0  0  0  0  0   0
 5:  0  0  0  0  0  0  0  0   0
 6:  1  1  1  1  1  1  1  1   1

The use of with=FALSE is nicely explained in the documentation for the j argument in ?data.table:

j: A single column name, single expresson of column names, list() of expressions of column names, an expression or function call that evaluates to list (including data.frame and data.table which are lists, too), or (when with=FALSE) same as j in [.data.frame.

From v1.10.2 onwards it is also possible to do this as follows:

keep <- setdiff(names(dt), "V9")
dt[, ..keep]

Prefixing a symbol with .. will look up in calling scope (i.e. the Global Environment) and its value taken to be column names or numbers (source).

score 14 · Answer 2 · edited Sep 27 '19 at 07:33

14

Edit 2019-09-27 with a more modern approach

You can do this with patterns as mentioned above; or, you can do it with ! if there's a vector of names already:

dt[ , !'V3']
# or
drop_cols = 'V3'
dt[ , !..drop_cols]

.. means "look up one level"

Older version using with=FALSE (data.table is moving away from this argument steadily)

Here's a way that uses grep to convert to numeric and allow negative column indexing:

dt[, -grep("^V3$", names(dt)), with=FALSE]

You did say "V3" was to be excluded, right?

edited Sep 27 '19 at 07:33

MichaelChirico

33,841
14
113
198

answered Aug 13 '12 at 19:30

IRTFM

258,963
21
364
487

What about whit this approach if you want to exclude more than one column? @DWin – nhern121 Aug 13 '12 at 20:35
3

It's very easy to add conditions with regex-grep. Just use the OR symbol "|". `dt[ , -grep("^V3$|^V9$", names(dt), with=FALSE` – IRTFM Aug 14 '12 at 04:14
@IRTFM how does it work if you only want to use one element in a vector such as drop_cols(i) in a loop for example? – Herman Toothrot Jun 10 '20 at 14:16
@HermanToothrot I'm not certain I understand the question. Perhaps you meant to use a multi-element character vector named `drop_cols` created outside the loop and index with the for-loop iterator? I thought that it might involve using `drop_cols[i]` after the double-dot within the loop, but I was wrong. Deserves another question. – IRTFM Jun 10 '20 at 19:30

YvanR · Answer 3 · 2019-10-28T22:53:54.690

14

Maybe it's only in recent versions of data.table (I'm using 1.9.6), but you can do:

dt[, -'V3']

For several columns:

dt[, -c('V3', 'V9')]

Note that the quotes around the variable names are necessary. Also, if your column names are stored in a variable, say cols, you'll need to do dt[, -cols, with=FALSE].

edited Oct 28 '19 at 22:53

answered Feb 11 '16 at 22:25

YvanR

365
2
6

score 7 · Answer 4 · answered Jan 17 '19 at 07:23

7

From version 1.12.0 onwards, it is also possible to select columns using regular expressions on their names:

iris_DT <- as.data.table(iris)

iris_DT[, .SD, .SDcols = patterns(".e.al")]

answered Jan 17 '19 at 07:23

der_grund

1,898
20
36

IVIM · Answer 5 · 2021-12-07T13:12:21.307

To summarize the answer to this question, and also to make it
a) negation-friendly (so that you can also select columns by negation),
b) pipe-line friendly (so that you can use in a pipeline with %>% operator), and
c) so that you can select using both column numbers and column names, these are available options:

library(data.table);

select1 <- function (dt, range)  dt[, range, with=F]
select2 <- function (dt, range)  dt[, ..range]
select3 <- function (dt, range)  dt[, .SD, .SDcols=range] 

dt <- ggplot2::diamonds

range <- 1:3 # or 
range <- dt %>% names %>% .[1:3]

dt %>% select1(range);
dt %>% select2(range); 
dt %>% select3(range); 

dt %>% select1(-range);
dt %>% select2(-range); 
dt %>% select3(-range); # DOES NOT WORK

Also we note that this
dt %>% .[, ..(names(dt)[1:3])] # DOES NOT WORK

Therefore the best (most universal and fast) way to select multiple columns in data.table is the following:

# columns are selected using column numbers:
range <- 1:3
dt %>% select1(range); 
dt %>% .[, range, with=F]

# The same works if columns are selected using column names:
range <- names(dt) [1:3]
dt %>% select1(range); 
dt %>% .[, range, with=F]

PS. If, instead of selecting multiple columns, you want to efficiently delete multiple columns from data.table by reference (i.e. instead of copying the entire data.table), then you can use data.table's := operator. But I don't know how to do it for multiple columns in one line

Selecting a subset of columns in a data.table

5 Answers5

Linked

Related