When should I use "which" for subsetting?

Question

It is a toy example.

 iris %>% 
  group_by(Species) %>% 
  summarise(max = Sepal.Width[Sepal.Length == max(Sepal.Length)])

 # A tibble: 3 x 2
  Species      max
  <fct>      <dbl>
1 setosa       4  
2 versicolor   3.2
3 virginica    3.8

It gives the same output when using which().

iris %>% 
  group_by(Species) %>% 
  summarise(max = Sepal.Width[which(Sepal.Length == max(Sepal.Length))])
# summarise(max = Sepal.Width[which.max(Sepal.Length)])

# A tibble: 3 x 2
  Species      max
  <fct>      <dbl>
1 setosa       4  
2 versicolor   3.2
3 virginica    3.8

help(which) says:

Give the TRUE indices of a logical object, allowing for array indices.

== does same thing: shows TRUE & FALSE

So when is which() useful for subsetting?

Related question/answers with other uses for `which`: https://stackoverflow.com/questions/6918657/whats-the-use-of-which — Mikko Marttila, Aug 19 '18 at 08:30
`==` returns a logical, `which` returns an integer, some functions accept both inputs and in this case they are often equivalent (answers explain exceptions), but you shouldn't expect it to be always the case. — moodymudskipper, Aug 19 '18 at 19:19

score 11 · Answer 1 · answered Aug 19 '18 at 03:54

When "==" ends up with NA. Try (1:2)[which(c(TRUE, NA))] v.s. (1:2)[c(TRUE, NA)].

If NA is not removed, indexing by NA gives NA (see ?Extract). However, this removal cannot be done by na.omit, as otherwise you may get positions of TRUE potentially wrong. A safe way is to replace NA by FALSE then do indexing. But why not just use which?

score 3 · Accepted Answer · answered Aug 19 '18 at 09:19

Since this question is specifically about subsetting, I thought I would illustrate some of the performance benefits of using which() over a logical subset brought up in the linked question.

When you want to extract the entire subset, there is not much difference in processing speed, but using which() needs to allocate less memory. However,if you only want a part of the subset (e.g. to showcase some strange findings), which() has a significant speed and memory advantage due to being able to avoid subsetting a data frame twice by subsetting the result of which() instead.

Here are the benchmarks:

df <- ggplot2::diamonds; dim(df)
#> [1] 53940    10
mu <- mean(df$price)

bench::press(
  n = c(sum(df$price > mu), 10),
  {
    i <- seq_len(n)
    bench::mark(
      logical = df[df$price > mu, ][i, ],
      which_1 = df[which(df$price > mu), ][i, ],
      which_2 = df[which(df$price > mu)[i], ]
    )
  }
)
#> Running with:
#>       n
#> 1 19657
#> 2    10
#> # A tibble: 6 x 11
#>   expression     n      min     mean   median      max `itr/sec` mem_alloc
#>   <chr>      <dbl> <bch:tm> <bch:tm> <bch:tm> <bch:tm>     <dbl> <bch:byt>
#> 1 logical    19657    1.5ms   1.81ms   1.71ms   3.39ms      553.     5.5MB
#> 2 which_1    19657   1.41ms   1.61ms   1.56ms   2.41ms      620.    2.89MB
#> 3 which_2    19657 826.56us 934.72us 910.88us   1.41ms     1070.    1.76MB
#> 4 logical       10 893.12us   1.06ms   1.02ms   1.93ms      941.    4.21MB
#> 5 which_1       10  814.4us 944.81us 908.16us   1.78ms     1058.    1.69MB
#> 6 which_2       10 230.72us 264.45us 249.28us   1.08ms     3781.  498.34KB
#> # ... with 3 more variables: n_gc <dbl>, n_itr <int>, total_time <bch:tm>

Created on 2018-08-19 by the reprex package (v0.2.0).

Wow, It's a useful example!, Why `which_1 = df[which(df$price > mu), ][i, ], which_2 = df[which(df$price > mu)[i], ]` have difference? — Wooheon, Aug 19 '18 at 09:21
@0Hong That's because it's much slower to get a subset of a dataframe than a subset of a vector. In `which_1` you get 2 dataframe subsets, while in `which_2` you get 1 vector subset and 1 dataframe subset. — Mikko Marttila, Aug 19 '18 at 09:27
@0Hong Dataframe subsets are slow because you have to essentially do a vector subset for each column of the data frame (this is also why it's slower to subset a dataframe with many columns than one with fewer columns). If in a list subset you want to get a subset for each element in the list, it'll also be slow. But if you just get a subset of the list elements, then it'll be faster. — Mikko Marttila, Aug 19 '18 at 09:41

score 1 · Answer 3 · answered Aug 19 '18 at 10:35

The which removes the NA elements. If we need to get the same behavior as which where there are NAsuse another condition along with==`

iris %>% 
  group_by(Species) %>% 
  summarise(max = Sepal.Width[Sepal.Length == max(Sepal.Length, na.rm = TRUE) & 
                                   !is.na(Sepal.Length)])

When should I use "which" for subsetting?

3 Answers3

Linked

Related