43

I'm trying to get a handle on the ubiquitous which function. Until I started reading questions/answers on SO I never found the need for it. And I still don't.

As I understand it, which takes a Boolean vector and returns a weakly shorter vector containing the indices of the elements which were true:

> seq(10)
 [1]  1  2  3  4  5  6  7  8  9 10
> x <- seq(10)
> tf <- (x == 6 | x == 8)
> tf
 [1] FALSE FALSE FALSE FALSE FALSE  TRUE FALSE  TRUE FALSE FALSE
> w <- which(tf)
> w
[1] 6 8

So why would I ever use which instead of just using the Boolean vector directly? I could maybe see some memory issues with huge vectors, since length(w) << length(tf), but that's hardly compelling. And there are some options in the help file which don't add much to my understanding of possible uses of this function. The examples in the help file aren't of much help either.

Edit for clarity-- I understand that the which returns the indices. My question is about two things: 1) why you would ever need to use the indices instead of just using the boolean selector vector? and 2) what interesting behaviors of which might make it preferred to just using a vectorized Boolean comparison?

Ari B. Friedman
  • 71,271
  • 35
  • 175
  • 235
  • library(fortunes) fortune(175) – mdsumner Aug 03 '11 at 00:17
  • @mdsumner: Fair, but not the most substantive of examples ;-). You can actually give `fortune()` a which vector, but it just takes the first element: `if (length(which) > 1) which <- which[1]` – Ari B. Friedman Aug 03 '11 at 00:29
  • So did I. Thus the ;-)... :-) – Ari B. Friedman Aug 03 '11 at 00:53
  • 1
    I like discussions like this, great question. Always makes me think. I often use which even when I don't need to cause I like the idea of feeding in the vector indices rather than TRUE/FALSE for some reason. – nzcoops Aug 03 '11 at 01:33
  • I find the question strange, it is useful since I use it all the time! Sometimes you want the actual numbers, sometimes a logical vector the same length - an obvious one is so you can carry the "ID" to other sets of data as you subset and so on. Or, a logical vector is not very helpful when you want elements before and after a particular one. diff() on the which() values is also good for picking gaps and patterns etc. – mdsumner Aug 03 '11 at 03:39
  • @mdsumner It's the using it all the time that prompted the question--people seem to use it even when the (IMO more readable) boolean alternative works just as well. So I was wondering if there's something more going on here. The question has brought out quite a few neat tricks, including your `diff(which())`, so I'm quite glad I asked it. – Ari B. Friedman Aug 03 '11 at 09:37
  • I find it hard to understand how any serious use of R can happen without sometimes numbers sometimes booleans. I am tempted to ask back how you get by without it - though it is easy enough to replace for one offs with : and [. – mdsumner Aug 04 '11 at 10:22
  • @mdsumner: No one who's seen my would could call it serious. That's why they call those people who pay for stuff FUNders :-) – Ari B. Friedman Aug 04 '11 at 13:10
  • I struggled to choose that term, toyed with "long term" and the like - I find it fundamental is the point, as in you would give up in disgust using R if you were constantly trying to persist with boolean indexing exclusively. It's SO useful that I don't get the question here, though I appreciate exercising the idea to detail the contrast. – mdsumner Aug 04 '11 at 14:20
  • @mdsumner: I think it's an interesting technique, but I swear I haven't used it once yet. Now that I've seen some interesting tricks with it and, moreover, have an overriding philosophy (joran's use `which` when you "need to access elements whose positions are a function of the positions of other elements"), I'll likely use it a lot more. – Ari B. Friedman Aug 04 '11 at 16:30

7 Answers7

26

Okay, here is something where it proved useful last night:

In a given vector of values what is the index of the 3rd non-NA value?

> x <- c(1,NA,2,NA,3)
> which(!is.na(x))[3]
[1] 5

A little different from DWin's use, although I'd say his is compelling too!

jverzani
  • 5,600
  • 2
  • 21
  • 17
20

The title of the man page ?which provides a motivation. The title is:

Which indices are TRUE?

Which I interpret as being the function one might use if you want to know which elements of a logical vector are TRUE. This is inherently different to just using the logical vector itself. That would select the elements that are TRUE, not tell you which of them was TRUE.

Common use cases were to get the position of the maximum or minimum values in a vector:

> set.seed(2)
> x <- runif(10)
> which(x == max(x))
[1] 5
> which(x == min(x))
[1] 7

Those were so commonly used that which.max() and which.min() were created:

> which.max(x)
[1] 5
> which.min(x)
[1] 7

However, note that the specific forms are not exact replacements for the generic form. See ?which.min for details. One example is below:

> x <- c(4,1,1)
> which.min(x)
[1] 2
> which(x==min(x))
[1] 2 3
Ari B. Friedman
  • 71,271
  • 35
  • 175
  • 235
Gavin Simpson
  • 170,508
  • 25
  • 396
  • 453
  • 3
    Thanks for the answer. Added example at the end in an edit. Hope that's proper etiquette. If not, lemme know and I'll revert. – Ari B. Friedman Aug 02 '11 at 21:53
  • 2
    Here's still my issue with this. I've used `x[which.max(x)]` before and found it handy. But I found it handy because it saved typing `x[x==max(x)]`. In other words, because it was a shortcut not because there was some reason I needed the *position* vs. the Boolean. What does having a vector of positions get you over a vector of Booleans? – Ari B. Friedman Aug 02 '11 at 21:56
  • 2
    @gsk3 One reason is the `NA` business the DWin mentioned. Consider the `x` from my example: `x[4] <- NA` then try `x[x == max(x, na.rm = TRUE)]`. Some people would consider the `NA` in the returned ugly. – Gavin Simpson Aug 02 '11 at 22:04
  • Agreed. `na.rm` seems to be the major point in its favor. Although in special cases I can see actually needing the indices as with the middle part of @doroczig's answer. – Ari B. Friedman Aug 02 '11 at 22:28
17

Two very compelling reasons not to forget which:

1) When you use "[" to extract from a dataframe, any calculation in the row position that results in NA will get a junk row returned. Using which removes the NA's. You can use subset or %in%, which do not create the same problem.

> dfrm <- data.frame( a=sample(c(1:3, NA), 20, replace=TRUE), b=1:20)
> dfrm[dfrm$a >0, ]
      a  b
1     1  1
2     3  2
NA   NA NA
NA.1 NA NA
NA.2 NA NA
6     1  6
NA.3 NA NA
8     3  8
# Snipped  remaining rows

2) When you need the array indicators.

IRTFM
  • 258,963
  • 21
  • 364
  • 487
  • 1
    Interesting. I'm not totally convinced that not returning NA is always an advantage though. Seems better to return NA and force the user to think about what they really want to do with the NAs, then use which if they want. – Ari B. Friedman Aug 02 '11 at 21:48
  • Yeah. I've heard people make that argument. But. It's a real hassle to have your console fill up with thousands of NA's when your 4 million record data.frame has 1-2% NA's and you are really only interested in the couple of hundred with real values in that column that meet your search logic. – IRTFM Aug 02 '11 at 22:19
  • Good point. Still not enough for me to give up the clarity of Boolean comparison though, except in special cases. – Ari B. Friedman Aug 02 '11 at 22:26
12

which could be useful (by the means of saving both computer and human resources) e.g. if you have to filter the elements of a data frame/matrix by a given variable/column and update other variables/columns based on that. Example:

df <- mtcars

Instead of:

df$gear[df$hp > 150] <- mean(df$gear[df$hp > 150])

You could do:

p <- which(df$hp > 150)
df$gear[p] <- mean(df$gear[p])

Extra case would be if you have to filter a filtered elements what could not be done with a simple & or |, e.g. when you have to update some parts of a data frame based on other data tables. This way it is required to store (at least temporary) the indexes of the filtered element.

Another issue what cames to my mind if you have to loop thought a part of a data frame/matrix or have to do other kind of transformations requiring to know the indexes of several cases. Example:

urban <- which(USArrests$UrbanPop > 80)
> USArrests[urban, ] - USArrests[urban-1, ]
              Murder Assault UrbanPop  Rape
California       0.2      86       41  21.1
Hawaii         -12.1    -165       23  -5.6
Illinois         7.8     129       29   9.8
Massachusetts   -6.9    -151       18 -11.5
Nevada           7.9     150       19  29.5
New Jersey       5.3     102       33   9.3
New York        -0.3     -31       16  -6.0
Rhode Island    -2.9      68       15  -6.6

Sorry for the dummy examples, I know it makes not much sense to compare the most urbanized states of USA by the states prior to those in the alphabet, but I hope this makes sense :)

Checking out which.min and which.max gives some clue also, as you do not have to type a lot, example:

> row.names(mtcars)[which.max(mtcars$hp)]
[1] "Maserati Bora"
daroczig
  • 28,004
  • 7
  • 90
  • 124
  • But you can store the results of a comparison also: `p <- ( df$hp > 150 )`. No parens necessary; I just like them for clarity. I do like the use of `urban-1` though. Can see a place where maybe everything's alphabetized and you want the previous entry, or it's sorted in ascending order and you want the previous entry, or something like that. – Ari B. Friedman Aug 02 '11 at 22:25
  • 3
    +1 for the `urban - 1` example (although its a very simple case). @gsk3 There will be cases where you need to access elements whose positions are a *function* of the *positions* of other elements. The boolean vector doesn't help much there. Of course, how *often* that type of situation arises will depend on what you're doing. ;) – joran Aug 03 '11 at 00:08
11

Well, I found one possible reason. At first I thought it might be the ,useNames option, but it turns out that simple boolean selection does that too.

However, if your object of interest is a matrix, you can use the ,arr.ind option to return the result as (row,column) ordered pairs:

> x <- matrix(seq(10),ncol=2)
> x
     [,1] [,2]
[1,]    1    6
[2,]    2    7
[3,]    3    8
[4,]    4    9
[5,]    5   10
> which((x == 6 | x == 8),arr.ind=TRUE)
     row col
[1,]   1   2
[2,]   3   2
> which((x == 6 | x == 8))
[1] 6 8

That's a handy trick to know about, but hardly seems to justify its constant use.

Ari B. Friedman
  • 71,271
  • 35
  • 175
  • 235
7

Surprised no one has answered this: how about memory efficiency?

If you have a long vector of very sparse TRUE's, then keeping track of only the indices of the TRUE values will probably be much more compact.

Nick Sabbe
  • 11,684
  • 1
  • 43
  • 57
  • Some of the comments have addressed *computational* efficiency, but you're I believe the first to mention memory efficiency, which is also pretty important on large datasets (and irrelevant for small ones :-) ). – Ari B. Friedman Oct 05 '11 at 10:33
4

I use it quiet often in data exploration. For example if I have a dataset of kids data and see from summary that the max age is 23 (and should be 18), I might go:

sum(dat$age>18)

If that was 67, and I wanted to look closer I might use:

dat[which(dat$age>18)[1:10], ]

Also useful if you're making a presentation and want to pull out a snippet of data to demonstrate a certain oddity or what not.

nzcoops
  • 9,132
  • 8
  • 41
  • 52
  • `head(dat[dat$age>18,])` is generally how I would do that, although on a test data.frame I just played with it was twice as slow. – Ari B. Friedman Aug 03 '11 at 09:28
  • 2
    I think you've answered your own question there gsk3. There's very few things that can't be done in multiple ways in R. I'm generally working with datasets with >1mil rows so if I used things like `head(dat[dat$age>18,])` I would get no where fast. – nzcoops Aug 05 '11 at 00:40
  • Yeah. I've only done one project in R so far with what I'd consider big data, and it was fairly slow. I'll definitely add `which` to my bag of tricks for speed optimization.... – Ari B. Friedman Aug 05 '11 at 06:54
  • The "quiet" you're looking for is "quite." :) – Abe Jan 10 '17 at 00:02