4

At the moment, when trying to select columns not directly by their name, but rather with grepl, one has to add an additional call to the data.table in the grepl command using colnames(). Is it possible to directly implement this functionality into data.table, so that one could use grepl directly and the column names are automatically taken from the current data.table.

dt <- structure(list(Global.Company.Key = c(1380L, 1380L, 1380L, 1380L, 1380L)
                         , Calendar.Data.Year.and.Quarter = structure(c(2000, 2000, 2000, 2000, 2000), class = "yearqtr")
                         , Current.Assets.Total = c(2218, 2218, 2218, 2218, 2218)
                         , DRILL_TYPE = c("U", "D", "V", "H", "U")
                         , DI.Oil.Prod.Quarter = c(18395.6792379842, 1301949.24041659, 235.311086392291, 27261.8049684835, 4719.27956989249)
                         , DI.Gas.Prod.Quarter = c(1600471.27107983, 4882347.22928982, 2611.60215053765, 9634.76418242493, 27648.276603634)), .Names = c("Global.Company.Key", "Calendar.Data.Year.and.Quarter", "Current.Assets.Total", "DRILL_TYPE", "DI.Oil.Prod.Quarter",  "DI.Gas.Prod.Quarter"), row.names = c(NA, -5L), class = c("data.table",  "data.frame"), sorted = c("Global.Company.Key",  "Calendar.Data.Year.and.Quarter"))

One example for a selection based with grepl:

dt[, grepl(glob2rx("Current.Assets*"), colnames(dt)), with = FALSE]

It would be nice, if something like this would be possible instead:

dt[, grepl(glob2rx("Current.Assets*")), with = FALSE]
Krantz
  • 1,424
  • 1
  • 12
  • 31
hannes101
  • 2,410
  • 1
  • 17
  • 40
  • can you specify what is `glob2rx`? – DJJ Mar 24 '17 at 12:54
  • `glob2rx` is just my way to get around regex, because it allows to use * as wildcard. – hannes101 Mar 24 '17 at 12:57
  • 2
    SO is for programming questions, while you're asking for a new feature. The FR is already posted, btw: https://github.com/Rdatatable/data.table/issues/1878 and https://github.com/Rdatatable/data.table/issues/1786 – Frank Mar 24 '17 at 13:12
  • I didn't find it, sorry. Should I remove the question or keep it as a reference to the feature request? – hannes101 Mar 24 '17 at 13:15
  • Up to you. I think it's fine either way. If you want to keep it as a ref, you could drop a comment on one of the issues like "please update / answer this SO post if this is implemented". – Frank Mar 24 '17 at 13:18
  • Am I missing something or [this](http://stackoverflow.com/a/40129395/1274242) would solve the issue? `dt[, .SD, .SDcols = names(dt) %like% "Current.Assets"]` – fridaymeetssunday Mar 24 '17 at 13:30
  • @fridaymeetssunday, my problem is pretty much the same as in https://github.com/Rdatatable/data.table/issues/1878#issuecomment-256406057 because in my case the `data.table` I am referring to is a pretty complicated subset, so the code becomes quite messy to say the least. – hannes101 Mar 24 '17 at 13:34

1 Answers1

3
library(data.table)

dt <- data.table(CurrentAssets=rnorm(10),FixedAssets=rnorm(10), CurrentLiabilities=rnorm(10),Capital=rnorm(10))

dt

##    CurrentAssets FixedAssets CurrentLiabilities    Capital
## 1:   -1.27610992  -0.2989316         0.20688252  0.6504636
## 2:    0.01065576   1.3088539         1.22533006  0.7550024
## 3:    0.53308022  -1.3459419        -0.99627142 -0.7589336
## 4:    0.30737237  -0.4291044         2.20328357  0.2157515
## 5:   -1.37391990   0.8581097        -0.08161687  0.7067757
## 6:    0.28664468   0.2308479         0.38675487 -0.3467660
## 7:   -0.22902454   1.3365470         0.10128697  0.3246363
## 8:    0.05159736  -2.0702850         0.78404464 -1.7612696
## 9:    0.51817847  -0.8365225        -0.04778573  0.6170114
##10:    0.50859575   0.5683021        -0.13780167 -0.9243434

Just some random columns. The accounts don't balance. You can define the columns, then do ...

colnames <- c("CurrentAssets","FixedAssets", "CurrentLiabilities","Capital")
dt[,.SD,.SDcols=grep("Assets",colnames,value =TRUE)]

If you don't want to type colnames and value=TRUE all the time you can build your own function like the following.

mygrep <- function(x){
    colnames <- c("CurrentAssets","FixedAssets", "CurrentLiabilities","Capital")
    grep(x,colnames,value=TRUE)
}

Now the drawback is of mygrep is that you need to put the column name manually. An improvement would be to pass the data.table to the function.

mygrep <- function(x,dt){
    colnames <- colnames(dt)
    grep(x,colnames,value=TRUE)
}

dt[,.SD,.SDcols=mygrep("Assets",dt)]

Edit Just found another way to do the same thing using macro in R. You will need the package gtools to use macros.

We define a macro subdt.

library(gtools)
  subdt <- defmacro(dt,pattern,expr={
   dt[,.SD,.SDcols=grep(pattern,colnames(dt),value=TRUE)]
    })

then do

subdt(dt,"Assets")

macros are powerful as they write the code before evaluation.

OliverHennhoefer
  • 677
  • 2
  • 8
  • 21
DJJ
  • 2,481
  • 2
  • 28
  • 53
  • Made my code much more readable, although it's not the final solution, probably the best available. – hannes101 Mar 24 '17 at 16:56
  • I wonder if @DJJ could provide a brief explanation why (example on mtcars) `mtcars_dt[,.SD,.SDcols = grep("m", colnames(mtcars_dt), value =TRUE)]` needs to be used for data.table while for `data.frame` the following works `mtcars[, grepl("m", colnames(mtcars))]` (but not for `data.table`)? This could helps others understand further the differences in data.table and data.frame. I recently tackled the same issue and your answer was the solution but I still wonder how it works in data.table and why base approach had to be adapted. – gofraidh Mar 10 '21 at 19:03
  • @gofraidh I would suggest posting a question on SO. There are people with more knowledge than me about data.table. If you have yet read it [here](https://cran.r-project.org/web/packages/data.table/vignettes/datatable-intro.html#what-is-datatable-1a) is a good starting point. For data.table, a solution would be the following. `tmp <- grepl("m", colnames(aa)); aa[,..tmp]` – DJJ Mar 11 '21 at 20:29
  • 3
    Now, data.table has also the `patterns` key. `aa<- as.data.table(mtcars); aa[,.SD,.SDcols=patterns("m")]`. see `?data.table` for more details. – DJJ Mar 11 '21 at 20:36