Getting rows of a matrix which coincide with a series of vectors, without using apply

Question

My question is sort of related to my earlier question.

Suppose I have one matrix and 4 vectors (can consider this another matrix, since the order of the vectors matters), and I want to get the row numbers which coincide to each vector, in order. I would like the solution to avoid repeating vectors and be as efficient as possible, since the problem is large scale.

Example.

 set.seed(1)

    M = matrix(rpois(50,5),5,10)
    v1 = c(3, 2, 7, 7, 4, 4, 7,  4, 5, 6)
    v2=  c(8, 6,  4, 4, 3,  8,  3, 6, 5, 6)
    v3=  c(4,  8, 3,  5, 9, 4, 5,  6, 7 ,7)
    v4=  c(4,  9, 3, 6,  3, 1, 5, 7,6, 1)

Vmat = cbind(v1,v2,v3,v4)

M
     [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,]    4    8    3    5    9    4    5    6    7     7
[2,]    4    9    3    6    3    1    5    7    6     1
[3,]    5    6    6   11    6    4    5    2    7     5
[4,]    8    6    4    4    3    8    3    6    5     6
[5,]    3    2    7    7    4    4    7    4    5     6

Vmat
      v1 v2 v3 v4
 [1,]  3  8  4  4
 [2,]  2  6  8  9
 [3,]  7  4  3  3
 [4,]  7  4  5  6
 [5,]  4  3  9  3
 [6,]  4  8  4  1
 [7,]  7  3  5  5
 [8,]  4  6  6  7
 [9,]  5  5  7  6
[10,]  6  6  7  1

The output should be...

5 4 1 2

I think `apply` is the right way to go, even though you don't like it. `m = apply(M,1,paste0,collapse="_"); v = apply(Vmat,2,paste0,collapse="_"); match(v,m)` — Frank, Sep 18 '15 at 15:07
Frank, avoiding rep(v1,) to do some comparison. If you see the other question, some solutions repeat the vector, but the last one does not, which is much more efficient. Apply will not work, it's too slow to loop like that. — wolfsatthedoor, Sep 18 '15 at 15:09
@robert, here is an entire vectorized solution below, taking into account the fact that we can permute columns by repeating `vmat` to compare them to columns of `t(M)` — Colonel Beauvel, Sep 18 '15 at 16:22

score 5 · Answer 1 · answered Sep 18 '15 at 15:16

5

I think collapsing each vector to a single value is the way to go, following @bunk:

m = do.call(function(...) paste(...,sep="_"), split(M, col(M)))
v = sapply(list(v1,v2,v3,v4), paste0, collapse="_")
match(v,m)
# [1] 5 4 1 2

The more natural way of building m would use apply, but that's verboten. If you store M as a data.frame, another option is:

m = do.call(function(...) paste(...,sep="_"), as.data.frame(M))

answered Sep 18 '15 at 15:16

Frank

66,179
8
96
180

2

This is so much faster than the merge solution above. I am going to post an answer comparing the two answers. Such a good answer, thank you. – wolfsatthedoor Sep 18 '15 at 16:21

Frank · Accepted Answer · 2015-09-22T03:20:20.767

Similar to @user295691's answer, we merge, but now with which=TRUE option in merge.data.table:

set.seed(1)
matdata  <- create_data(1e6,20,1e5) # using @user295691's example data

library(data.table)
M = as.data.table(matdata$M)
V = as.data.table(matdata$V)

r <- M[V, on=names(V), which=TRUE]

To verify that it is correct...

V[1,]
#    V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18 V19 V20
# 1:  7  5  3  2  5  6  3  3  5   5   3   2   4   9   4   4   3   6   4   3
M[r[1],]
#    V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18 V19 V20
# 1:  7  5  3  2  5  6  3  3  5   5   3   2   4   9   4   4   3   6   4   3

Benchmarks

OP's example data (in a deleted answer):

set.seed(1)

NM    = 1e6
NV    = 1e5
Ncols = 20
MM = matrix(rpois(NM*Ncols,Ncols),NM,Ncols)

rows=sample(NM,NV,replace = FALSE)

Vmat=t(MM[rows,])

# converted to data.frames, because why not?
M = as.data.frame(MM)
V = as.data.frame(t(Vmat))

# converted to data.tables
M2 = setDT(copy(M))
V2 = setDT(copy(V))

Functions to test:

match_strings <- function(){
  m = do.call(function(...) paste(...,sep="_"), M)
  v = do.call(function(...) paste(...,sep="_"), V)
  match(v,m)
}

merge_df <- function(){ # from @user295691's answer
  M$mid = seq(nrow(M))
  V$vid = seq(nrow(V))
  with(merge(M,V), mid[order(vid)])
}

merge_dt <- function(){
  M2[V2, on=names(V2), which=TRUE]
}

Results:

system.time({r_strings = match_strings()})
#    user  system elapsed 
#   10.40    0.06   10.49     
system.time({r_merge_df = merge_df()})
#    user  system elapsed 
#   14.71    0.10   14.84
system.time({r_merge_dt = merge_dt()})
#    user  system elapsed 
#    0.39    0.00    0.40 

identical(r_strings,r_merge_df) # TRUE
identical(r_strings,r_merge_dt) # TRUE

@robertevansanders But so rewarding! In this case with 1.9.4, you can follow the `merge_df` style with `system.time({M2[, id := .I]; setkeyv(M2,names(V2)); M2[V2]$id -> r}) # .38 seconds for me` and `identical(r, r_strings) # TRUE` — Frank, Sep 18 '15 at 17:19
Can you explain what this code does? I am having a hard time getting it to work. M2[, id := .I]; setkeyv(M2,names(V2)); M2[V2]$id -> r — wolfsatthedoor, Sep 18 '15 at 17:49
`M2[, id := .I]` creates a row id column in M2 (same as `M$mid = seq(nrow(M))` inside `merge_df`). `setkeyv(M2,names(V2))` prepares `M2` for merging with `V2` by sorting on columns `V1:V20`, which is necessary in data.table 1.9.4. `M2[V2]` does the merge -- each row of the result corresponds to a row of `V2`, so we can just take `$id` and it will be sorted appropriately (unlike in `merge_df`) — Frank, Sep 18 '15 at 18:01
@robertevansanders sorry about that. 1.9.6 is now on CRAN. Perhaps Jan's suggestion about 'drat' is the way to go... — Arun, Sep 22 '15 at 22:20

user295691 · Answer 3 · 2015-09-18T15:45:22.563

If we switch these to data.frames, then we can use merge to do the trick. Also, we rotate Vmat for easy matching.

haystack <- as.data.frame(M)
haystack$haystack_id <- rownames(haystack)
needle <- as.data.frame(t(Vmat))
needle$needle_id <- rownames(needle)

lookups <- merge(needle, haystack)
lookups <- lookups[order(lookups$needle_id), ]

If we compare this to the string/match solution above, it appears to be faster by a reasonable degree

create_data <- function(haystack.rows, cols, needle.rows) {
   M <- matrix(rpois(haystack.rows * cols, 5), haystack.rows, cols)
   V <- M[sample(1:haystack.rows, needle.rows, replace=T),]
   list(M=M, V=V)
}

> set.seed(1); data <- create_data(1000000, 20, 10000);
> system.time({haystack <- as.data.frame(data$M); haystack$hid <- seq_along(haystack$V1); needle <- as.data.frame(data$V); needle$nid <- seq_along(needle$V1); ret <- merge(needle, haystack); ret <- ret[order(ret$nid),]})
   user  system elapsed
  5.900   0.000   5.906

> system.time({mstr <- apply(data$M, 1, paste0, collapse="_"); vstr <- apply(data$V, 1, paste0, collapse="_"); matchstr <- match(vstr, mstr)})
   user  system elapsed
  8.372   0.000   8.377

match on strings is much faster than merge but you have to pay the cost of transforming the data, whereas converting to a data frame is very cheap, since it uses the same underlying data.

EDIT: added a sort step to the merge version to get the rows in order. Also fixed a typo in the timed version of the merge version. Times remained in the same order of magnitude

EDIT2: Thanks to @Frank, found a bug in the match version of the time, which sped up things substantially (I had been using a local example called asdf which was even larger). Still not as fast as the merge solution, though.

No need to write your edit notes in the body of the answer. They can be included in the edit summary (a small text field visible when editing). Best to just make your answer into its best version, without documenting its history. — Frank, Sep 18 '15 at 15:47
@Frank: obviously a matter of opinion; I like having the edit notes because while the question is being discussed, I don't like it when things have changed that make my analysis of the answer invalid without warning. I could probably make a good case for removing the `EDIT` comments once an answer is accepted. — user295691, Sep 18 '15 at 15:50
Fair enough. Might want to add a `identical(ret$hid, matchstr)` to confirm that we're doing the same (presumably right) thing. Pretty standard for benchmarking. — Frank, Sep 18 '15 at 15:55
This speed is compared with Frank's answer , right? Is there theoretically anything faster than using match? I don't know much about matching algorithms' computational complexity. — wolfsatthedoor, Sep 18 '15 at 16:01
How is the OP getting reversed results, I wonder. Is their example data substantially different? — Frank, Sep 18 '15 at 16:30

Getting rows of a matrix which coincide with a series of vectors, without using apply

3 Answers3

Linked