Why is [- subsetting (i.e. deletion) of columns not possible with names?

Question

I fear greatly that this has been asked and will be downvoted, but I have not found the answer in the docs (?"["), and discovered that it is hard to search for.

data(wines)
# This is allowed:
alcoholic <- wines[, 1]
alcoholic <- wines[, "alcohol"]
nonalcoholic <- wines[, -1]
# But this is not:
fail <- wines[, -"alcohol"]

I know of two solutions, but am frustrated for need of them.

win <- wines[, !colnames(wines) %in% "alcohol"]  # snappy
win <- wines[, -which(colnames(wines) %in% "alcohol")]  # snappier!

Is `snappy` and `snappier` positive or negative measures? I prefer `setdiff` in these cases. What do you expect `-"alcohol"` to return? It doesn't work as a command by itself, so why would it work when trying to subset? — A5C1D2H2I1M1N2O1R2T1, Sep 05 '13 at 10:50
Maybe not an answer to your "Why" in terms on "why has someone chosen to implement it this way", but anyway: from `?[`: "For `[`-indexing only: i, j, ... can be logical vectors (your `!` alternative) [...] can also be negative integers (your `which` alternative). — Henrik, Sep 05 '13 at 10:57
@AnandaMahto I was being sarcastic, so negative connotations. Expectations of anything in R? I have very few expectations after even my little experience with it :) (That was humour). Can you give an example of how `setdiff` would handle this case? — a different ben, Sep 05 '13 at 10:58
if you're just looking for something shorter: `wines[names(wines)!="alcohol"]` — plannapus, Sep 05 '13 at 11:00
@plannapus Thanks, that's the shortest! Only good for one name though isn't it? I would need to use %in% for a list of names I think. — a different ben, Sep 05 '13 at 11:22
Where does the wines data set come from? I get 'not found' (R 2.15, so maybe its new). — Spacedman, Sep 05 '13 at 13:37
@adifferentben yes indeed you would. For a vector of names it would become `wines[!names(wines)%in%c(...)]`. — plannapus, Sep 05 '13 at 13:37
Filling in a useful link from a now-deleted answer by Dieter Menne about a response from Brian Ripley on this topic on the R mailing list: http://r-project.markmail.org/thread/sdg7mopk4towqbm4 — Ben Bolker, Sep 05 '13 at 16:40
Or you can just delete the column by reference if `wines` were a `data.table`: `wines[,alcohol:=NULL]`. That's instant no matter how big the data is. So if the data is large it's more efficient than copying every column other than the one you want to delete. If not it doesn't matter really. — Matt Dowle, Sep 05 '13 at 17:02
@Spacedman, the wines data set is in the `kohonen` package, and maybe a few others? It's a classic for machine learning examples. You could also get it at the UCI Machine Learning Repository: http://archive.ics.uci.edu/ml/ — a different ben, Sep 06 '13 at 05:13

score 19 · Accepted Answer · answered Sep 05 '13 at 10:58

19

When you do

wines[, -1]

-1 is evaluated before it is used by [. As you know, the - unary operator won't work with object of class character, so doing the same with "alcohol" will lead you to:

Error in -"alcohol" : invalid argument to unary operator

You can add the following to your alternatives:

wines[, -match("alcohol", colnames(wines))]
wines[, setdiff(colnames(wines), "alcohol")]

but you should know about the risks of negative indexing, e.g., see what happens if you mistype "alcool" (sic.) So your first suggestion and the last one here (@Ananda's) should be preferred. You might also want to write a function that will error out if you provide a name that is not part of your data.

answered Sep 05 '13 at 10:58

flodel

87,577
21
185
223

`R> -1` gives `[1] -1`, so how does that work? I am not so familiar with the way R works. Is that what you mean? – a different ben Sep 05 '13 at 11:05
I'll have to write a compendium of idioms for deleting a column, thanks for the additions :) – a different ben Sep 05 '13 at 11:07
Yes, `-1` is something that evaluates fine, so you can pass it as an argument to the `[` function and it will know what to do with it. On the other side, `-"alcohol"` does not. It has less to do with how `[` is implemented, more with the fact that you cannot compute `-"alcohol"`, hence pass it to `[` or any function. – flodel Sep 05 '13 at 11:10
I normally answer these types of questions with `-which()` is evil, and then point the way to `setdiff`. +1 – A5C1D2H2I1M1N2O1R2T1 Sep 05 '13 at 11:16
Forgive me I was a little slow in getting your meaning. Thanks @Ananda and flodel. – a different ben Sep 05 '13 at 11:17
1

For real fun, compare foo[-0] and foo[-c(0,1)] . IIRC flodel discussed zeroes in a SO question a few months back. – Carl Witthoft Sep 05 '13 at 11:29

Ben Bolker · Answer 2 · 2013-09-05T17:22:41.010

9

Another possibility:

subset(wines,select=-alcohol)

You can even do

subset(wines,select=-c(alcohol,other_drop))

In fact, if you have a contiguous set of columns you want to drop, you can even

subset(wines,select=-(first_drop:last_drop))

which can be handy (although IMO it depends dangerously on the order of columns, which is something that might be fragile: I might prefer a grep-based solution if there were some way to identify columns, or a more explicit separate definition of column groups).

In this case subset is using non-standard evaluation, which as has been discussed elsewhere can be dangerous in some contexts. But I still like it for simple, top-level data manipulation because of its readability.

edited Sep 05 '13 at 17:22

answered Sep 05 '13 at 12:59

Ben Bolker

211,554
25
370
453

1

The subset function converts the select expression to numbers via a named vector of numbers, which is why the ":" method works. – IRTFM Sep 05 '13 at 17:30
1

@DWin, `?subset` says `This is a convenience function intended for use interactively. For programming it is better to use the standard subsetting functions like [, and in particular the non-standard evaluation of argument subset can have unanticipated consequences.` Why? What are the non-standard evaluations it refers to. These ones Ben Bolker has listed? – a different ben Sep 06 '13 at 00:26

IRTFM · Answer 3 · 2013-09-16T15:53:58.590

6

Another method that uses numeric indexing and generalizes to situations where you wnat to remove a bunch of similarly named columns:

dfrm[ , -grep("^val", names(dfrm) )] #remove columns starting with "val"

(I gave my vote to flodel, since his answer described "why" a "minus sign" didn't work. Essentially because the R authors didn't overload the "-" operator for that purpose. They also didn't overload "+" to do concatenation in the manner that some languages did.

edited Sep 16 '13 at 15:53

answered Sep 05 '13 at 11:18

IRTFM

258,963
21
364
487

So they *could* be overloaded if the devs so chose? – a different ben Sep 05 '13 at 11:19
@adifferentben Or you could overload the ops yourself if you dare :-). – Carl Witthoft Sep 05 '13 at 11:27
The authors of the lattice and ggplot2 plotting systems have overloaded the "+" operator, so there are no fundamental barriers to overloading "-". – IRTFM Sep 05 '13 at 17:08

score 3 · Answer 4 · answered Sep 05 '13 at 10:59

How about write a simple little function and stick it in your .Rprofile. Something like...

dropcols <- function( df , cols ){
  out <- df[ , !names(df) %in% cols]
  return( out )
}

#  To use it....
data( mtcars )
head( dropcols( mtcars , "mpg" ) )
#                  cyl disp  hp drat    wt  qsec vs am gear carb
#Mazda RX4           6  160 110 3.90 2.620 16.46  0  1    4    4
#Mazda RX4 Wag       6  160 110 3.90 2.875 17.02  0  1    4    4
#Datsun 710          4  108  93 3.85 2.320 18.61  1  1    4    1
#Hornet 4 Drive      6  258 110 3.08 3.215 19.44  1  0    3    1
#Hornet Sportabout   8  360 175 3.15 3.440 17.02  0  0    3    2
#Valiant             6  225 105 2.76 3.460 20.22  1  0    3    1

Yep, that's a useful way to solve it. Not very portable however for others so I'm a little disinclined to do it. I've avoided that sort of thing in general partly for that reason, but also I always forget to sync my work to home machine to laptop, etc, and forget what's in my .Rprofile anyway! — a different ben, Sep 05 '13 at 11:02

score 3 · Answer 5 · edited Sep 05 '13 at 17:08

I can't find this in the documentation, but the following syntax works with data.table:

dt = data.table(wines)

dt[, !"alcohol", with = F]

And you can also have a list of columns if you like:

dt[, !c("Country", "alcohol"), with = F]

It was just documented in NEWS for v1.8.4 it seems :

When with=FALSE, "!" may also be a prefix on j, #1384ii. This selects all but the named columns.

DF[,-match("somecol",names(DF))]
# works when somecol exists. If not, NA causes an error.

DF[,-match("somecol",names(DF),nomatch=0)]
# works when somecol exists. Empty data.frame when it doesn't, silently.

DT[,-match("somecol",names(DT)),with=FALSE]
# same issues.

DT[,setdiff(names(DT),"somecol"),with=FALSE]
# works but you have to know order of arguments, and no warning if missing

vs

DT[,!"somecol",with=FALSE]
# works and easy to read. With (helpful) warning if somecol isn't there.

But the above all copy every column other than the deleted one. More usually :

DT[,somecol:=NULL]

to delete the column by name by reference.

score 0 · Answer 6 · answered Sep 05 '13 at 10:58

0

You can get your desired behavior as follows:

data(iris)
str(iris)
delete <- which(colnames(iris) == "Species")
iris2 <- iris[, -delete]
str(iris2)

answered Sep 05 '13 at 10:58

Bryan Hanson

6,055
4
41
78

This is equivalent to matching a single string, as opposed to using `%in%` to match a list of strings. – a different ben Sep 05 '13 at 11:25
This could be simpliefied to `deleted <- colnames(iris) == "Species"; iris[!deleted]`. You don't need negative indexing when you got logical vector. – Marek Mar 19 '14 at 06:26

Why is [- subsetting (i.e. deletion) of columns not possible with names?

6 Answers6