Selecting non-consecutive columns by name in data.table r

Question

I have database on my disk and I'd like to select multiple columns using : occasionally within the data.table using column names.

Previous answers only include using indices for column selection which is not desirable for my case.

The example would be like below:

library(gapminder)
data(gapminder)
setDT(gapminder)

names(gapminder) # [1] "country"   "continent" "year"      "lifeExp"   "pop"       "gdpPercap"

# I would like to select columns from `country` to `year` and pop

gapminder[,country:year] # this one works



gapminder[,country:year + pop] # doesn't work
gapminder[,c(country:year,pop)] # doesn't work either

gapminder[,.SD, .SDcols = c(country:year,pop)] # doesn't work

I'm scratching my head on this. I'd appreciate any suggestions.

score 3 · Answer 1 · answered May 08 '20 at 01:20

I am not sure if there is really an easy solution to this in data.table but perhaps you could cbind the range of columns with individual column name.

library(data.table)
cbind(gapminder[,country:year], gapminder[, 'pop'])

However, your desired behavior is possible in dplyr.

library(dplyr)
gapminder %>% select(country:year, pop)


#       country continent year      pop
#1: Afghanistan      Asia 1952  8425333
#2: Afghanistan      Asia 1957  9240934
#3: Afghanistan      Asia 1962 10267083
#4: Afghanistan      Asia 1967 11537966
#5: Afghanistan      Asia 1972 13079460
#6: Afghanistan      Asia 1977 14880372

chinsoon12 · Accepted Answer · 2020-05-08T04:05:34.907

3

another option:

gapminder[, c(.SD, .(pop=pop)), .SDcols=country:year]

Or if you have more columns,

cols <- setNames(c("pop", "lifeExp"), c("pop", "lifeExp"))
gapminder[, c(.SD, mget(cols)), .SDcols=country:year]

output:

          country continent year      pop lifeExp
   1: Afghanistan      Asia 1952  8425333  28.801
   2: Afghanistan      Asia 1957  9240934  30.332
   3: Afghanistan      Asia 1962 10267083  31.997
   4: Afghanistan      Asia 1967 11537966  34.020
   5: Afghanistan      Asia 1972 13079460  36.088
  ---                                            
1700:    Zimbabwe    Africa 1987  9216418  62.351
1701:    Zimbabwe    Africa 1992 10704340  60.377
1702:    Zimbabwe    Africa 1997 11404948  46.809
1703:    Zimbabwe    Africa 2002 11926563  39.989
1704:    Zimbabwe    Africa 2007 12311143  43.487

edited May 08 '20 at 04:05

answered May 08 '20 at 01:32

chinsoon12

25,005
4
25
35

This method adds an arbitrary name, V4 for pop. It makes me hesitant to use this approach – Matthew Son May 08 '20 at 04:03
@MatthewSon, have addressed your concern in the update – chinsoon12 May 08 '20 at 04:05
Quick fix seems reasonably concise, but is there any way I can use multiple colons `:` in data.table? I mean, like choosing (a:c) and (e:g). – Matthew Son May 08 '20 at 04:28
prob not inside `[`, but you can write a function to get the columns before passing into `.SDcols` – chinsoon12 May 08 '20 at 04:48

jedmatic · Answer 3 · 2021-04-30T15:51:57.990

UPDATE

After consulting the data.table FAQs, specifically this part

1.2 Why does DT[,"region"] return a 1-column data.table rather than a vector? See the answer above. Try DT$region instead. Or DT[["region"]].

1.3 Why does DT[, region] return a vector for the “region” column? I’d like a 1-column data.table. Try DT[ , .(region)] instead. .() is an alias for list() and ensures a data.table is returned.

I realized that there was an even simpler solution.

In order to use cbind that preserves the column names, you need to pass two data tables. The issue with the column getting named V4 is because you're passing a vector to cbind.

But you can control whether data.table returns a vector or a 1-column data.table. Here's what that looks like in your case:

newest_gapminder2 <- cbind(gapminder[, country:year], gapminder[, 'pop'])

or

newest_gapminder3 <- cbind(gapminder[, country:year], gapminder[, .(pop)])

ORIGINAL RESPONSE

I found your question because I had the same one! I wanted to create a subset of a data table without needing to list every single column.

I tried a few different things and found something that I can live with...

## create a data table for this example
dt <- data.table("col1"=1:5, "col2"=2:6, "col3"=letters[2:6], "col4"=letters[1:5], "col5"=3:7)
dim(dt)
dt

## the goal is to create a subset of this data frame that contains col1, col3, col4, and col5

Sorry I didn't use your data. This should work the same though.

# method 1
## subset out a vector and give the column name
col1 <- dt[, col1]

## use cbind on the object and the data table subset
## the object name takes the place of the column name in the table
new_dt <- cbind(col1, dt[, col3:col5])

## check that the result is a data.table
class(new_dt)
dim(new_dt)
new_dt

That works out OK but it feels a little hack-y. The issue I was running into trying something like this:

dt_alt <- cbind(dt[, col1], dt[, col3:col5])

but dt[ , col1] creates a vector not a data table, and when it gets coerced in cbind, the resulting name is V1. So I thought, maybe it'd be easier to avoid binding single columns, and then just drop the unneeded columns after the fact.

# method 2
## take two different subsets/slices and cbind them 
new_dt2 <- cbind(dt[, col1:col2], dt[, col3:col5])

## take out col2
new_dt2[, col2 := NULL]

class(new_dt2)
dim(new_dt2)
new_dt2

That was a little bit better but then I was wondering about something even more streamlined. I thought about chaining in data.table and wanted to combine it with method 2. I want to credit this post for the := NULL technique.

# method 3
## thinking about how data.table works, can the := NULL be chained? 
## spoiler: it can!
## this feels like kind of a hack but...
new_dt3 <-cbind(dt[,col1:col2][, col2:=NULL], dt[,col3:col5])

class(new_dt3)
dim(new_dt3)
new_dt3

OK, after all that, I felt badly that I didn't use the gapminder data from your question, so here is my method #3 applied to your data:

gapminder <- cbind(gapminder[, country:year]), gapminder[, pop:gdpPercap][, gdpPercap := NULL])

I timed it using the technique described here.

   user  system elapsed 
   0.00    0.00    0.02

All three techniques were comparable. I'm not sure how this would perform on a multi-GB dataset, though.

Selecting non-consecutive columns by name in data.table r

3 Answers3