5

Is it possible to subset data frame columns (into new df) using column names stored in the array of column names - like in c("col1", "col9", "col6")? I know I can reference one column in df using df[[colname]] syntax but it does not let me do it for multiple columns:

df
   X1 X2 X3
1:  a  1  3
2:  b  5  3
3:  a  3  4
4:  c  6  5
5:  c  2  2

cnm<-c("X2","X3")

df[[cnm]]

Error in .subset2(x, i, exact = exact) : subscript out of bounds

thanks

aosmith
  • 34,856
  • 9
  • 84
  • 118
Zoran Krunic
  • 197
  • 1
  • 2
  • 12
  • thanks - first one works but requires converting data frame into table ... second one did not work when I tried: > cnm<-c("X2","X3") > df[cnm] Error in `[.data.table`(df, cnm) : When i is a data.table (or character vector), x must be keyed (i.e. sorted, and, marked as sorted) so data.table knows which columns to join to and take advantage of x being sorted. Call setkey(x,...) first, see ?setkey. – Zoran Krunic Sep 14 '16 at 18:59
  • Second one will not work because your dataset is `data.table` – akrun Sep 14 '16 at 19:00

1 Answers1

7

Based on the OP's dataset, it seems like a data.table. For subsetting columns in data.table, we need with = FALSE

df[, cnm, with = FALSE]
#   X2 X3
#1:  1  3
#2:  5  3
#3:  3  4
#4:  6  5
#5:  2  2

According to the ?data.table documentation

with - By default with=TRUE and j is evaluated within the frame of x; column names can be used as variables.

When with=FALSE j is a character vector of column names, a numeric vector of column positions to select or of the form startcol:endcol, and the value returned is always a data.table. with=FALSE is often useful in data.table to select columns dynamically. Note that x[, cols, with=FALSE] is equivalent to x[, .SD, .SDcols=cols].

If the dataset is data.frame, just

setDF(df)#convert to 'data.frame'
df[cnm]
#   X2 X3
#1  1  3
#2  5  3
#3  3  4
#4  6  5
#5  2  2

will subset the dataset

The [[ is for extracting a single column of data.frame or list element


Applying the OP's code in a data.table gets the same error message

df[[cnm]]

Error in .subset2(x, i, exact = exact) : subscript out of bounds

If we do the data.frame subsetting option in data.table, it will not work either

df[cnm]

Error in [.data.table(df, cnm) : When i is a data.table (or character vector), the columns to join by must be specified either using 'on=' argument (see ?data.table) or by keying x (i.e. sorted, and, marked as sorted, see ?setkey). Keyed joins might have further speed benefits on very large data due to x being sorted in RAM.

akrun
  • 874,273
  • 37
  • 540
  • 662