How to subset data.table on the basis of a values in a certain column number

Question

In data.table, one way to subset a table on the basis of a numerical vector of column numbers involves using with=FALSE.

I'm trying to loop through a data.table on the basis of a numerical vector of column numbers, looking for rows meeting a certain criterion, as follows:

require(data.table)

ab=data.table(id=c("geneA", "geneB", "geneC", "geneA", "geneA", "geneB", "", "NA"),
              co1=c(1,2,3,0,7), co2=c(0,0,4,5,6), nontarget=c(9,0,7,6,5), 
              co3=c(0,1,2,3,4))
target_col_nums=grep('co', colnames(ab))

##Data.table doesn't treat colnames(ab)[i] as one of the
##  column name variables, and with=F only seems to work for j in dt[i,j,by]
for (i in target_col_nums){
    print(ab[colnames(ab)[i]>3])
}

##This produces the desired output
ab[co1>3]
ab[co2>3]
ab[co3>3]

In my situation, my actual table is quite large, so I can't use the colnames themselves.

I hope that this is a useful question to the community.

"my actual table is quite large, so I can't use the colnames themselves." This does not make sense to me. If your column names are meaningless, go ahead and use a matrix instead, but otherwise, I think they belong in a string/character vector instead of being referred to by number. Have a look at 1.1 here for why this is usually advised https://rawgit.com/wiki/Rdatatable/data.table/vignettes/datatable-faq.html#j-num — Frank, Mar 21 '16 at 03:59

score 4 · Answer 1 · answered Mar 21 '16 at 18:13

4

for (col in grep('co', names(ab), value = T))
  print(ab[get(col) > 3])
#      id co1 co2 nontarget co3
#1: geneA   7   6         5   4
#      id co1 co2 nontarget co3
#1: geneC   3   4         7   2
#2: geneA   0   5         6   3
#3: geneA   7   6         5   4
#4:    NA   3   4         7   2
#      id co1 co2 nontarget co3
#1: geneA   7   6         5   4

answered Mar 21 '16 at 18:13

eddi

49,088
6
104
155

1

basically the snazzier version of mine! great exemplification of the `value` argument – MichaelChirico Mar 22 '16 at 04:51

score 3 · Accepted Answer · edited May 23 '17 at 10:28

3

You can evaluate (eval) the columns as an expression

for (i in target_col_nums){
    expr <- paste0(colnames(ab)[i], ">3")
    print(ab[eval(parse(text = expr)), ])
}

#      id co1 co2 nontarget co3
#1: geneA   7   6         5   4
#      id co1 co2 nontarget co3
#1: geneC   3   4         7   2
#2: geneA   0   5         6   3
#3: geneA   7   6         5   4
#4:    NA   3   4         7   2
#      id co1 co2 nontarget co3
#1: geneA   7   6         5   4

Or you can try any of the suggestions in the question passing variables as data.table column names

edited May 23 '17 at 10:28

Community

1
1

answered Mar 21 '16 at 02:44

SymbolixAU

25,502
4
67
139

1

Might as well do `expr <- parse(text=paste0("print(ab[", names(ab)[target_col_nums], ">3])"))`. Then can just do `sapply(expr, eval)`. ` – MichaelChirico Mar 21 '16 at 14:36

score 2 · Answer 3 · answered Mar 21 '16 at 14:31

Your approach can be adjusted very slightly and still get around using column numbers (which, though not so harmful in this case since you got the numbers programatically, is generally bad practice):

target_cols = names(ab)[grepl("co", names(ab))]

sapply(target_cols, function(jj) print(ab[get(jj) > 3]))

Wrap in invisible if the NULL input is a distraction/otherwise bothers you.

score 1 · Answer 4 · answered Mar 21 '16 at 03:00

We can specify the 'i' in .SDcols and use the condition on .SD to get a logical vector, which can be used for subsetting the rows.

for(i in target_col_nums){
 print(ab[ab[, .SD[[1L]] >3, .SDcols = i]])
}
#         id co1 co2 nontarget co3
#1: geneA   7   6         5   4
#      id co1 co2 nontarget co3
#1: geneC   3   4         7   2
#2: geneA   0   5         6   3
#3: geneA   7   6         5   4
#4:    NA   3   4         7   2
#      id co1 co2 nontarget co3
#1: geneA   7   6         5   4

How to subset data.table on the basis of a values in a certain column number

4 Answers4