3

Consider removing those elements from a vector that match a certain set if criteria. The expected behaviour is to remove those that match, and, in particular, if none match then remove none:

> d = 1:20
> d
 [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20
> d[-which(d > 10)]
 [1]  1  2  3  4  5  6  7  8  9 10
> d[-which(d > 100)]
integer(0)

We see here that the final statement has both done something very unexpected and silently hidden the error without even a warning.

I initially thought that this was an undesirable (but consistent) consequence of the choice that an empty index selects all elements of a vector

http://stat.ethz.ch/R-manual/R-devel/library/base/html/Extract.html

as is commonly used to e.g. select the first column of a matrix, m, by writing

m[ , 1]

However the behaviour observed here is consistent with interpreting an empty vector as "no elements", not "all elements":

> a = integer(0)

selecting "no elements" works exactly as expected:

> v[a]
numeric(0)

however removing "no elements" does not:

> v[-a]
numeric(0)

For an empty vector to both select no elements and remove all elements requires inconsistency.

Obviously it is possible to work around this issue, either by checking that the which() returns non-zero length or using a logical expression as covered here In R, why does deleting rows or cols by empty index results in empty data ? Or, what's the 'right' way to delete?

but my two questions are:

  1. Why is the behaviour inconsistent?
  2. Why does it silently do the wrong thing without an error or warning?
Community
  • 1
  • 1
user2711915
  • 2,704
  • 1
  • 18
  • 17
  • If which dosen't find any find it return integer(0) therfore if you try to substract the 0 element of the vector it called the 0 element of the vector since -0 = 0 in R – Nico Coallier May 17 '17 at 14:20
  • 1
    I try to always use logical vectors for subsetting for this reason – talat May 17 '17 at 14:28
  • If you want to take a deeper dive into exactly how and why the subset primitive behaves the way it does, and you're comfortable with C, you can look at the source here: https://github.com/wch/r-source/blob/trunk/src/main/subset.c But I think @Patronus's answer is a pretty good one. – Empiromancer May 17 '17 at 14:30
  • I don't think going into the C source can shed any more light. The issue is just order of operations as Patronus effectively says. I think Nico Coallier above is almost right too, except that I believe integer(0) is not a single integer with value zero, it is an integer vector of length zero. It is not that -0 = 0 (which is incidentally true, but irrelevant) but that an empty vector is unchanged by negation. – user2711915 May 17 '17 at 14:44

1 Answers1

5

This doesn't work because which(d > 100) and -which(d > 100) are the same object: there is no difference between an empty vector and the negative of that empty vector.

For example, imagine you did:

d = 1:10

indexer = which(d > 100)
negative_indexer = -indexer

The two variables would be the same (which is the only consistent behavior- turning all the elements of an empty vector negative leaves it the same since it has no elements).

indexer
#> integer(0)
negative_indexer
#> integer(0)
identical(indexer, negative_indexer)
#> [1] TRUE

At that point, you couldn't expect d[indexer] and d[negative_indexer] to give different results. There is also no place to provide an error or warning: it doesn't know when passed an empty vector that you "meant" the negative version of that empty vector.


The solution is that for subsetting there's no reason you need which() at all: you could use d[d > 10] instead of your original example. You could therefore use !(d > 100) or d <= 100 for your negative indexing. This behaves as you'd expect because d > 10 or !(d > 100) are logical vectors rather than vectors of indices.

David Robinson
  • 77,383
  • 16
  • 167
  • 187
  • Ok, I think you've got to the problem there: even if you explicitly specify "-indexer", then order of operations means that the Extract operation only sees "indexer", because the negation is applied before the indexing gets to see it. Good general point on the logical selection rather than index selection. I was intending the which to be more of a motivating example than the whole problem. – user2711915 May 17 '17 at 14:40
  • On further inspection, it appears that R will not allow you to mix positive and negative subscripts i.e. v[c(1,-2)] is illegal (and probably meaningless - give only the first element and not the second?) so the interpreter could interpret explicitly writing v[-a] differently to v[a], and there is no situation where you could need to have pre-negated only some of the elements of a. However this just pushes the problem one step back, as b = -a, v[b] will still perform unexpectedly. – user2711915 May 17 '17 at 15:03
  • 1
    @user2711915 Yep, and that would also violate [referential transparency](http://stackoverflow.com/questions/210835/what-is-referential-transparency). Due to non-standard evaluation a number of things in R do violate referential transparency (e.g. `subset(mtcars, wt > 3)` works but `b <- wt > 3; subset(mtcars, b)` does not), but for something as simple as vector indexing the language chooses not to, and it's probably the right choice. – David Robinson May 17 '17 at 15:48
  • 1
    @user2711915 regarding this being a motivating example: I'd actually say this is one of the reasons I avoid using `which()` to index except when it's necessary! Logical indexing avoids odd exceptions like these. – David Robinson May 17 '17 at 15:51
  • Having used which and logicals in the past, I think I'll move to using which probably never in the future at least for selecting. This question covers more: http://stackoverflow.com/questions/6918657/whats-the-use-of-which – user2711915 May 17 '17 at 16:56