35

Say we have a table 'data' containing Strings in several columns. We want to find the indices of all rows that contain a certain value, or better yet, one of several values. The column, however, is unknown.

What I do, at the moment, is:

apply(df, 2, function(x) which(x == "M017"))

where df =

1 04.10.2009 01:24:51   M017  <NA>  <NA>    NA
2 04.10.2009 01:24:53   M018  <NA>  <NA>    NA
3 04.10.2009 01:24:54   M051  <NA>  <NA>    NA
4 04.10.2009 01:25:06   <NA>  M016  <NA>    NA
5 04.10.2009 01:25:07   <NA>  M015  <NA>    NA
6 04.10.2009 01:26:07   <NA>  M017  <NA>    NA
7 04.10.2009 01:26:27   <NA>  M017  <NA>    NA
8 04.10.2009 01:27:23   <NA>  M017  <NA>    NA
9 04.10.2009 01:27:30   <NA>  M017  <NA>    NA
10 04.10.2009 01:27:32   M017  <NA>  <NA>    NA
11 04.10.2009 01:27:34   M051  <NA>  <NA>    NA

This also works if we try to find more than one value:

apply(df, 2, function(x) which(x %in% c("M017", "M018")))

The result being:

$`1`
integer(0)

$`2`
[1]  1  2 20

$`3`
[1] 16 17 18 19

$`4`
integer(0)

$`5`
integer(0)

However, processing the resulting list of lists is rather tedious.

Is there a more efficient way to find rows that contain a value (or more) in ANY column?

sebastianmm
  • 1,148
  • 1
  • 8
  • 26
  • 6
    You could try `which(df == "M017"), arr.ind=TRUE)` It will give the row, column index from that you can extract the rows. – akrun Jan 30 '15 at 10:17
  • For multiple values, I think `apply` family based (sapply, lapply) solutions might be better. One option would be to vectorise. ie. if `v1 <- c('M017', 'M018'); which(Vectorize(function(x) x %in% v1)(df), arr.ind=TRUE)` – akrun Jan 30 '15 at 10:26
  • Thanks, akrun, I deleted my comment because you answered my question before I posted it. – sebastianmm Jan 30 '15 at 10:35
  • 3
    Do you have many values which you want to check? If not you could easily vectorize the whole thing to something like `rowSums(df == "M017" | df == "M018", na.rm = TRUE) > 0L` and avoid `apply` loops all together. – David Arenburg Jan 30 '15 at 11:51
  • @DavidArenburg That would be my choice if there were only two values. – akrun Jan 30 '15 at 12:03
  • In my case, it's one or two. Now I'm wondering, though, what the magic number is here. Is it less efficient with 3 values? Or 30? – sebastianmm Jan 30 '15 at 12:29
  • @sebowski I think one problem is in typing the whole thing `df=='MO17'|df=='M018'|....`. The `Reduce` function in my post is doing similar thing for multiple values (sort of general). It is slower as I included 200 values... In the two value case, you should try what DavidArenburg suggested – akrun Jan 30 '15 at 12:31
  • The biggest problem IMO is the overhead of calling `df` too many times. While the `apply` loops does it only once. Thus it could be interesting to test when the `apply` loop becomes more efficient. – David Arenburg Jan 30 '15 at 12:49

4 Answers4

40

How about

apply(df, 1, function(r) any(r %in% c("M017", "M018")))

The ith element will be TRUE if the ith row contains one of the values, and FALSE otherwise. Or, if you want just the row numbers, enclose the above statement in which(...).

konvas
  • 14,126
  • 2
  • 40
  • 46
  • 5
    It should be noted that for very few values to find, @DavidArenburg 's suggestion of `rowSums(df == "M017" | df == "M018", na.rm = TRUE) > 0L` would be more efficient. – sebastianmm Jan 30 '15 at 13:14
  • is it possible to get the rows in a list as an output ? instead of having the raw value of apply – trexgris Nov 06 '19 at 21:23
9

Here's a dplyr option:

library(dplyr)

# across all columns:
df %>% filter_all(any_vars(. %in% c('M017', 'M018')))

# or in only select columns:
df %>% filter_at(vars(col1, col2), any_vars(. %in% c('M017', 'M018')))                                                                                                     
sbha
  • 9,802
  • 2
  • 74
  • 62
  • do you know how to return a boolean that satisfies the conditions of the `filter_at()` statement that you provided? – Tyler R. Aug 14 '19 at 21:51
  • answer [here](https://stackoverflow.com/questions/57512813/dplyr-create-new-boolean-variable-from-any-vars-filter-condition/57512829) for anyone interested. – Tyler R. Aug 15 '19 at 17:31
6

If you want to find the rows that have any of the values in a vector, one option is to loop the vector (lapply(v1,..)), create a logical index of (TRUE/FALSE) with (==). Use Reduce and OR (|) to reduce the list to a single logical matrix by checking the corresponding elements. Sum the rows (rowSums), double negate (!!) to get the rows with any matches.

indx1 <- !!rowSums(Reduce(`|`, lapply(v1, `==`, df)), na.rm=TRUE)

Or vectorise and get the row indices with which with arr.ind=TRUE

indx2 <- unique(which(Vectorize(function(x) x %in% v1)(df),
                                     arr.ind=TRUE)[,1])

Benchmarks

I didn't use @kristang's solution as it is giving me errors. Based on a 1000x500 matrix, @konvas's solution is the most efficient (so far). But, this may vary if the number of rows are increased

val <- paste0('M0', 1:1000)
set.seed(24)
df1 <- as.data.frame(matrix(sample(c(val, NA), 1000*500, 
  replace=TRUE), ncol=500), stringsAsFactors=FALSE) 
set.seed(356)
v1 <- sample(val, 200, replace=FALSE)

 konvas <- function() {apply(df1, 1, function(r) any(r %in% v1))}
 akrun1 <- function() {!!rowSums(Reduce(`|`, lapply(v1, `==`, df1)),
               na.rm=TRUE)}
 akrun2 <- function() {unique(which(Vectorize(function(x) x %in% 
              v1)(df1),arr.ind=TRUE)[,1])}


 library(microbenchmark)
 microbenchmark(konvas(), akrun1(), akrun2(), unit='relative', times=20L)
 #Unit: relative
 #   expr       min         lq       mean     median         uq      max   neval
 # konvas()   1.00000   1.000000   1.000000   1.000000   1.000000  1.00000    20
 # akrun1() 160.08749 147.642721 125.085200 134.491722 151.454441 52.22737    20
 # akrun2()   5.85611   5.641451   4.676836   5.330067   5.269937  2.22255    20
 # cld
 #  a 
 #  b
 #  a 

For ncol = 10, the results are slighjtly different:

expr       min        lq     mean    median        uq       max    neval
 konvas()  3.116722  3.081584  2.90660  2.983618  2.998343  2.394908    20
 akrun1() 27.587827 26.554422 22.91664 23.628950 21.892466 18.305376    20
 akrun2()  1.000000  1.000000  1.00000  1.000000  1.000000  1.000000    20

data

 v1 <- c('M017', 'M018')
 df <- structure(list(datetime = c("04.10.2009 01:24:51",
"04.10.2009 01:24:53", 
"04.10.2009 01:24:54", "04.10.2009 01:25:06", "04.10.2009 01:25:07", 
"04.10.2009 01:26:07", "04.10.2009 01:26:27", "04.10.2009 01:27:23", 
"04.10.2009 01:27:30", "04.10.2009 01:27:32", "04.10.2009 01:27:34"
), col1 = c("M017", "M018", "M051", "<NA>", "<NA>", "<NA>", "<NA>", 
"<NA>", "<NA>", "M017", "M051"), col2 = c("<NA>", "<NA>", "<NA>", 
"M016", "M015", "M017", "M017", "M017", "M017", "<NA>", "<NA>"
), col3 = c("<NA>", "<NA>", "<NA>", "<NA>", "<NA>", "<NA>", "<NA>", 
"<NA>", "<NA>", "<NA>", "<NA>"), col4 = c(NA, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, NA)), .Names = c("datetime", "col1", "col2", 
"col3", "col4"), class = "data.frame", row.names = c("1", "2", 
"3", "4", "5", "6", "7", "8", "9", "10", "11"))
sebastianmm
  • 1,148
  • 1
  • 8
  • 26
akrun
  • 874,273
  • 37
  • 540
  • 662
  • Thanks. I added benchmark results for a table more similar to mine (nrow = 10). The results are somewhat different. – sebastianmm Jan 30 '15 at 12:56
  • @sebowski Yes, as I mentioned earlier in the post, the benchmarks will be different based on the length of `v1`, nrow, ncol etc. When you have only two values for `v1`, and a dataset of 1e7 with less number of columns, apply may be less efficient. – akrun Jan 30 '15 at 12:58
  • I also ran the test with 4 and 10 Strings to find (`v1`), and the results are virtually identical for `ncol = 10`. – sebastianmm Jan 30 '15 at 13:10
  • @akrun, Thank you for your post! I had a similar requirement where I have to create a new column in the data frame having the value that is matched from the vector. I am using indx2 as I was able to understand this code than indx1 (still learning R). I wanted to check how can now I use the return values of rows and column to pick up the respective value when creating the new column. So not all rows would satisfy the condition for them it should be blank or 0. Could you please suggest. Thank you!! – user1412 Jul 28 '17 at 12:21
1

Since _at, _all functions are superseded in dplyr now, we may use if_any to select rows which has any one of the values.

library(dplyr)

df %>% filter(if_any(starts_with("col"), ~.x %in% c("M017", "M018")))

#              datetime col1 col2 col3 col4
#1  04.10.2009 01:24:51 M017 <NA> <NA>   NA
#2  04.10.2009 01:24:53 M018 <NA> <NA>   NA
#6  04.10.2009 01:26:07 <NA> M017 <NA>   NA
#7  04.10.2009 01:26:27 <NA> M017 <NA>   NA
#8  04.10.2009 01:27:23 <NA> M017 <NA>   NA
#9  04.10.2009 01:27:30 <NA> M017 <NA>   NA
#10 04.10.2009 01:27:32 M017 <NA> <NA>   NA
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213