Return indices of rows whose elements (columns) all match a reference vector

Question

Using the following code;

  c <- NULL
  for (a in 1:4){
    b <- seq(from = a, to = a + 5)
    c <- rbind(c,b)
    }
  c <- rbind(c,c); rm(a,b)

Results in this matrix,

> c
  [,1] [,2] [,3] [,4] [,5] [,6]
b    1    2    3    4    5    6
b    2    3    4    5    6    7
b    3    4    5    6    7    8
b    4    5    6    7    8    9
b    1    2    3    4    5    6
b    2    3    4    5    6    7
b    3    4    5    6    7    8
b    4    5    6    7    8    9

How can I return row indices for rows matching a specific input?

For example, with a search term of,

z <- c(3,4,5,6,7,8)

I need the following returned,

[1] 3 7

This will be used in a fairly large data frame of test data, related to a time step column, to reduce the data by accumulating time steps for matching rows.

Question answered well by others. Due to my dataset size (9.5M rows), I came up with an efficient approach that took a couple steps.

1) Sort the big data frame 'dc' containing time steps to accumulate in column 1.

dc <- dc[order(dc[,2],dc[,3],dc[,4],dc[,5],dc[,6],dc[,7],dc[,8]),]

2) Create a new data frame with unique entries (excluding column 1).

dcU <- unique(dc[,2:8])

3) Write Rcpp (C++) function to loop through unique data frame which iterates through the original data frame accumulating time while rows are equal and indexes to the next for loop step when an unequal row is identified.

  require(Rcpp)
  getTsrc <-
    '
  NumericVector getT(NumericMatrix dc, NumericMatrix dcU)
  {
  int k = 0;
  int n = dcU.nrow();
  NumericVector tU(n);
  for (int i = 0; i<n; i++)
    {
    while ((dcU(i,0)==dc(k,1))&&(dcU(i,1)==dc(k,2))&&(dcU(i,2)==dc(k,3))&&
           (dcU(i,3)==dc(k,4))&&(dcU(i,4)==dc(k,5))&&(dcU(i,5)==dc(k,6))&&
           (dcU(i,6)==dc(k,7)))
      {
      tU[i] = tU[i] + dc(k,0);
      k++;
      }
    }
  return(tU);
  }
    '
  cppFunction(getTsrc)

4) Convert function inputs to matrices.

  dc1 <- as.matrix(dc)
  dcU1 <- as.matrix(dcU)

5) Run the function and time it (returns time vector matching unique data frame)

  pt <- proc.time()
  t <- getT(dc1, dcU1)
  print(proc.time() - pt)

   user  system elapsed 
   0.18    0.03    0.20

6) Self high-five and more coffee.

Are all numbers integer? If so, is there an upper bound to the integers? — cryo111, Dec 08 '15 at 16:54
You could probably post the new Rcpp part of this as a new question to get more attention. — jeremycg, Dec 08 '15 at 22:27

jeremycg · Accepted Answer · 2015-12-08T17:03:42.280

7

You can use apply.

Here we use apply on c, across rows (the 1), and use a function function(x) all(x == z) on each row.

The which then pulls out the integer positions of the rows.

which(apply(c, 1, function(x) all(x == z)))
b b 
3 7

EDIT: If your real data is having problems with this, and is only 9 columns (not too much typing), you could try a fully vectorized solution:

which((c[,1]==z[1] & c[,2]==z[2] & c[,3]==z[3] & c[,4]==z[4]& c[,5]==z[5]& c[,6]==z[6]))

edited Dec 08 '15 at 17:03

answered Dec 08 '15 at 14:50

jeremycg

24,657
5
63
74

1

Works perfectly for the example! My source data frame `dc` has 9.3M rows and 9 numeric columns. `unique(dc)` returns 3.5M rows. Attempting to use this approach to accumulate time for the first row appears to choke the system and has not completed in some time. I will have to experiment to insure I apply correctly. I've used Rcpp for big for loops previously, might be approach here in absence of an efficient function. -- Thanks! – Scott Smith Dec 08 '15 at 16:32
@ScottSmith Did you see my answer, and jeremycg's comment on it, below? I was actually wondering if this would happen to you! – rbatt Dec 08 '15 at 17:19

score 7 · Answer 2 · answered Dec 08 '15 at 16:21

7

The answer by @jeremycg will definitely work, and is fast if you have many columns and few rows. However, you might be able to go a bit faster if you have lots of rows by avoiding using apply() on the row dimension.

Here's an alternative:

l <- unlist(apply(c, 2, list), recursive=F)
logic <- mapply(function(x,y)x==y, l, z)
which(.rowSums(logic, m=nrow(logic), n=ncol(logic)) == ncol(logic))

[1] 3 7

It works by first turning each column into a list. Then, it takes each column-list and searches it for the corresponding element in z. In the last step, you find out which rows had all columns with the corresponding match in z. Even though the last step is a row-wise operation, by using .rowSums (mind the . at the front there) we can specify the dimensions of the matrix, and get a speed-up.

Let's test the timings of the two approaches.

The functions

f1 <- function(){
    which(apply(c, 1, function(x) all(x == z)))
}

f2 <- function(){
    l <- unlist(apply(c, 2, list), recursive=F)
    logic <- mapply(function(x,y)x==y, l, z)
    which(.rowSums(logic, m=nrow(logic), n=ncol(logic)) == ncol(logic))
}

With 8 rows (dim in example):

> time <- microbenchmark(f1(), f2())
> time
Unit: microseconds
 expr    min      lq     mean  median     uq     max neval cld
 f1() 21.147 21.8375 22.86096 22.6845 23.326  30.443   100  a 
 f2() 42.310 43.1510 45.13735 43.7500 44.438 137.413   100   b

With 80 rows:

Unit: microseconds
 expr     min      lq     mean   median       uq     max neval cld
 f1() 101.046 103.859 108.7896 105.1695 108.3320 166.745   100   a
 f2()  93.631  96.204 104.6711  98.1245 104.7205 236.980   100   a

With 800 rows:

> time <- microbenchmark(f1(), f2())
> time
Unit: microseconds
 expr     min       lq      mean    median        uq       max neval cld
 f1() 920.146 1011.394 1372.3512 1042.1230 1066.7610 31290.593   100   b
 f2() 572.222  579.626  593.9211  584.5815  593.6455  1104.316   100  a

Note that my timing assessment only had 100 replicates each, and although these results are reprsentative, there's a bit a of variability in the number of rows required before the two methods are equal.

Regardless, I think my approach would probably be faster once you have 100+ rows.

Also, note that you can't simply transpose c to make f1() faster. First, the t() takes up time; second, because you're comparing to z, you would then just have to make a column-wise (after the transpose) comparison, so it's no different at that point.

Finally, I'm sure there's an even faster way to do this. My answer was just the first thing that came to mind, and didn't require any packages to install. This could be a lot faster if you wanted to use data.table. Also, if you had a lot of columns, you might even be able to parallelize this procedure (although, to be worthwhile the dataset would have to be immense).

If these timings aren't tolerable for your data, you might consider reporting back with the dimensions of your data set.

answered Dec 08 '15 at 16:21

rbatt

4,677
4
23
41

well how about the fully vectorised `which((c[,1]==z[1] & c[,2]==z[2] & c[,3]==z[3] & c[,4]==z[4]& c[,5]==z[5]& c[,6]==z[6]))`? – jeremycg Dec 08 '15 at 17:02
Fair point; that's essentially what I'm doing though, I think. Except my example works on an arbitrary number of columns; that's what the `mapply` would do. From there, instead of combining with `&` (which I agree is better), I use the sum of 1's and 0's (from TRUE and FALSE). But yeah, that's the idea! – rbatt Dec 08 '15 at 17:18
Really appreciate the feedback! I hope to finish this up this afternoon and will report back. My original dataset has 9 columns (analogous example: height, width) and 9.5M rows. Each row persists for a certain amount of time (time step). There are 3.5M unique rows which I want to bin time steps into to reduce the data size without affecting integrity. – Scott Smith Dec 08 '15 at 18:24
1

On the same concept, you could avoid checking all rows every time, by iteratively excluding the rows that fail a "==". E.g. if `c[i, 1] != z[1]`, then no need to compare c[i, 2] with z[2]. As a function, something like `function(x, pat) { wh = seq_len(nrow(x)); for(i in seq_along(pat)) wh = wh[x[wh, i] == pat[[i]]]; return(wh) }` where "x" and "pat" is OP's "c" and "z" respectively. – alexis_laz Dec 08 '15 at 19:33
@alexis_laz I think you're right for a very general case. But since we're working in R, "vectorizing" an operation is really fast (the loop is in C). If you move the loop into R and stop once you reach a column that is FALSE, you might have to do fewer iterations than the C loop, but doing the full thing as a C loop would still probably be faster. Unless, of course, there are many columns. That said, I should probably have done `Reduce("&&", c(T, F, T, T))`, which I think is essentially your suggestion. – rbatt Dec 08 '15 at 19:38
@alexis_laz Actually, I don't think I should do the `Reduce` either, because that would still need to be applied row-wise, and is probably much slower than the `.rowSums`. – rbatt Dec 08 '15 at 19:41
1

@rbatt : `mapply` and `for` loop are very similar in speed and we both apply the loops on a column basis. So, I assume that excluding rows to use for "==" will save some time, unless all rows match the pattern until the very last column. `Reduce` won't accumulate the excluding rows (unless, perhaps, passing a complex function as argument). A simple for loop that keeps track on which rows to choose is pretty straightforward. With a rough benchmark using `x = matrix(sample(203), 1e6, 10); pat = x[1234, ]`, the for loop above seems faster. – alexis_laz Dec 08 '15 at 19:46
@alexis_laz I misunderstood you, sorry. I wonder if your answer is on par (speed wise) with mine in the worst-case scenario for your answer (i.e., where all the columns, or all but the last column, are TRUE, for all rows). My hunch is that with real data your approach will typically be faster (only noticeable with many rows; but that's relevant for this question), but that it won't be faster in the aforementioned situations (the worst-case scenarios I pointed out). Are you going to write that up as an answer? – rbatt Dec 08 '15 at 20:01
@alexis_laz my understanding (and I could be wrong) is that when you chain logicals with &, it terminates when it hit the first FALSE. So, the vectorized version should already be doing that. – jeremycg Dec 08 '15 at 22:23
@rbatt : I tend to believe the same as you; in the worst-case (where all rows are matched), mine might be slower because it constantly subsets/updates/checks which rows to keep comparing, so a vector of `length == nrow(data)` will be read again and again while, still, checking all rows against each pattern's element. @jeremycg : The approach in your comment, still, checks all rows in every column. As for the `&`, unless I misunderstand, I don't think that it stops since each time, the binary function, "&" is called with 2 vectors of `length == nrow(data)`. – alexis_laz Dec 09 '15 at 13:29
Contrary to the "green" answer, this one doesn't work for the case `c` is a 1-row matrix (imagine a reference "dictionary" matrix that grows through the course of the algorithm..) Exemplary improvement: `if (is.vector(logic)) logic = matrix(logic, nrow=1)` to put before `which(.rowSums(logic, m=nrow(logic), n=ncol(logic)) == ncol(logic))`. – hanna Sep 08 '16 at 11:51

score -4 · Answer 3 · answered Dec 08 '15 at 14:48

-4

In your code c is not a data frame. Try transforming it into one:

c <- data.frame(c)

answered Dec 08 '15 at 14:48

Breathe

714
5
21

Return indices of rows whose elements (columns) all match a reference vector

3 Answers3

Linked