1

In data.table, one way to subset a table on the basis of a numerical vector of column numbers involves using with=FALSE.

I'm trying to loop through a data.table on the basis of a numerical vector of column numbers, looking for rows meeting a certain criterion, as follows:

require(data.table)

ab=data.table(id=c("geneA", "geneB", "geneC", "geneA", "geneA", "geneB", "", "NA"),
              co1=c(1,2,3,0,7), co2=c(0,0,4,5,6), nontarget=c(9,0,7,6,5), 
              co3=c(0,1,2,3,4))
target_col_nums=grep('co', colnames(ab))

##Data.table doesn't treat colnames(ab)[i] as one of the
##  column name variables, and with=F only seems to work for j in dt[i,j,by]
for (i in target_col_nums){
    print(ab[colnames(ab)[i]>3])
}

##This produces the desired output
ab[co1>3]
ab[co2>3]
ab[co3>3]

In my situation, my actual table is quite large, so I can't use the colnames themselves.

I hope that this is a useful question to the community.

MichaelChirico
  • 33,841
  • 14
  • 113
  • 198
Atticus29
  • 4,190
  • 18
  • 47
  • 84
  • 1
    Are you aware your example data generates warnings? – SymbolixAU Mar 21 '16 at 02:59
  • 1
    "my actual table is quite large, so I can't use the colnames themselves." This does not make sense to me. If your column names are meaningless, go ahead and use a matrix instead, but otherwise, I think they belong in a string/character vector instead of being referred to by number. Have a look at 1.1 here for why this is usually advised https://rawgit.com/wiki/Rdatatable/data.table/vignettes/datatable-faq.html#j-num – Frank Mar 21 '16 at 03:59

4 Answers4

4
for (col in grep('co', names(ab), value = T))
  print(ab[get(col) > 3])
#      id co1 co2 nontarget co3
#1: geneA   7   6         5   4
#      id co1 co2 nontarget co3
#1: geneC   3   4         7   2
#2: geneA   0   5         6   3
#3: geneA   7   6         5   4
#4:    NA   3   4         7   2
#      id co1 co2 nontarget co3
#1: geneA   7   6         5   4
eddi
  • 49,088
  • 6
  • 104
  • 155
3

You can evaluate (eval) the columns as an expression

for (i in target_col_nums){
    expr <- paste0(colnames(ab)[i], ">3")
    print(ab[eval(parse(text = expr)), ])
}

#      id co1 co2 nontarget co3
#1: geneA   7   6         5   4
#      id co1 co2 nontarget co3
#1: geneC   3   4         7   2
#2: geneA   0   5         6   3
#3: geneA   7   6         5   4
#4:    NA   3   4         7   2
#      id co1 co2 nontarget co3
#1: geneA   7   6         5   4

Or you can try any of the suggestions in the question passing variables as data.table column names

Community
  • 1
  • 1
SymbolixAU
  • 25,502
  • 4
  • 67
  • 139
  • 1
    Might as well do `expr <- parse(text=paste0("print(ab[", names(ab)[target_col_nums], ">3])"))`. Then can just do `sapply(expr, eval)`. ` – MichaelChirico Mar 21 '16 at 14:36
2

Your approach can be adjusted very slightly and still get around using column numbers (which, though not so harmful in this case since you got the numbers programatically, is generally bad practice):

target_cols = names(ab)[grepl("co", names(ab))]

sapply(target_cols, function(jj) print(ab[get(jj) > 3]))

Wrap in invisible if the NULL input is a distraction/otherwise bothers you.

MichaelChirico
  • 33,841
  • 14
  • 113
  • 198
1

We can specify the 'i' in .SDcols and use the condition on .SD to get a logical vector, which can be used for subsetting the rows.

for(i in target_col_nums){
 print(ab[ab[, .SD[[1L]] >3, .SDcols = i]])
}
#         id co1 co2 nontarget co3
#1: geneA   7   6         5   4
#      id co1 co2 nontarget co3
#1: geneC   3   4         7   2
#2: geneA   0   5         6   3
#3: geneA   7   6         5   4
#4:    NA   3   4         7   2
#      id co1 co2 nontarget co3
#1: geneA   7   6         5   4
akrun
  • 874,273
  • 37
  • 540
  • 662