3

My question is sort of related to my earlier question.

Suppose I have one matrix and 4 vectors (can consider this another matrix, since the order of the vectors matters), and I want to get the row numbers which coincide to each vector, in order. I would like the solution to avoid repeating vectors and be as efficient as possible, since the problem is large scale.

Example.

 set.seed(1)

    M = matrix(rpois(50,5),5,10)
    v1 = c(3, 2, 7, 7, 4, 4, 7,  4, 5, 6)
    v2=  c(8, 6,  4, 4, 3,  8,  3, 6, 5, 6)
    v3=  c(4,  8, 3,  5, 9, 4, 5,  6, 7 ,7)
    v4=  c(4,  9, 3, 6,  3, 1, 5, 7,6, 1)

Vmat = cbind(v1,v2,v3,v4)

M
     [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,]    4    8    3    5    9    4    5    6    7     7
[2,]    4    9    3    6    3    1    5    7    6     1
[3,]    5    6    6   11    6    4    5    2    7     5
[4,]    8    6    4    4    3    8    3    6    5     6
[5,]    3    2    7    7    4    4    7    4    5     6

Vmat
      v1 v2 v3 v4
 [1,]  3  8  4  4
 [2,]  2  6  8  9
 [3,]  7  4  3  3
 [4,]  7  4  5  6
 [5,]  4  3  9  3
 [6,]  4  8  4  1
 [7,]  7  3  5  5
 [8,]  4  6  6  7
 [9,]  5  5  7  6
[10,]  6  6  7  1

The output should be...

5 4 1 2
Community
  • 1
  • 1
wolfsatthedoor
  • 7,163
  • 18
  • 46
  • 90

3 Answers3

5

I think collapsing each vector to a single value is the way to go, following @bunk:

m = do.call(function(...) paste(...,sep="_"), split(M, col(M)))
v = sapply(list(v1,v2,v3,v4), paste0, collapse="_")
match(v,m)
# [1] 5 4 1 2

The more natural way of building m would use apply, but that's verboten. If you store M as a data.frame, another option is:

m = do.call(function(...) paste(...,sep="_"), as.data.frame(M))
Frank
  • 66,179
  • 8
  • 96
  • 180
  • 2
    This is so much faster than the merge solution above. I am going to post an answer comparing the two answers. Such a good answer, thank you. – wolfsatthedoor Sep 18 '15 at 16:21
3

Similar to @user295691's answer, we merge, but now with which=TRUE option in merge.data.table:

set.seed(1)
matdata  <- create_data(1e6,20,1e5) # using @user295691's example data

library(data.table)
M = as.data.table(matdata$M)
V = as.data.table(matdata$V)

r <- M[V, on=names(V), which=TRUE]

To verify that it is correct...

V[1,]
#    V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18 V19 V20
# 1:  7  5  3  2  5  6  3  3  5   5   3   2   4   9   4   4   3   6   4   3
M[r[1],]
#    V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18 V19 V20
# 1:  7  5  3  2  5  6  3  3  5   5   3   2   4   9   4   4   3   6   4   3

Benchmarks

OP's example data (in a deleted answer):

set.seed(1)

NM    = 1e6
NV    = 1e5
Ncols = 20
MM = matrix(rpois(NM*Ncols,Ncols),NM,Ncols)

rows=sample(NM,NV,replace = FALSE)

Vmat=t(MM[rows,])

# converted to data.frames, because why not?
M = as.data.frame(MM)
V = as.data.frame(t(Vmat))

# converted to data.tables
M2 = setDT(copy(M))
V2 = setDT(copy(V))

Functions to test:

match_strings <- function(){
  m = do.call(function(...) paste(...,sep="_"), M)
  v = do.call(function(...) paste(...,sep="_"), V)
  match(v,m)
}

merge_df <- function(){ # from @user295691's answer
  M$mid = seq(nrow(M))
  V$vid = seq(nrow(V))
  with(merge(M,V), mid[order(vid)])
}

merge_dt <- function(){
  M2[V2, on=names(V2), which=TRUE]
}

Results:

system.time({r_strings = match_strings()})
#    user  system elapsed 
#   10.40    0.06   10.49     
system.time({r_merge_df = merge_df()})
#    user  system elapsed 
#   14.71    0.10   14.84
system.time({r_merge_dt = merge_dt()})
#    user  system elapsed 
#    0.39    0.00    0.40 

identical(r_strings,r_merge_df) # TRUE
identical(r_strings,r_merge_dt) # TRUE
Frank
  • 66,179
  • 8
  • 96
  • 180
  • 2
    So frustrating trying to get 1.9.5 :( – wolfsatthedoor Sep 18 '15 at 17:13
  • 1
    @robertevansanders But so rewarding! In this case with 1.9.4, you can follow the `merge_df` style with `system.time({M2[, id := .I]; setkeyv(M2,names(V2)); M2[V2]$id -> r}) # .38 seconds for me` and `identical(r, r_strings) # TRUE` – Frank Sep 18 '15 at 17:19
  • I have windows which makes life hell haha :(. – wolfsatthedoor Sep 18 '15 at 17:32
  • Can you explain what this code does? I am having a hard time getting it to work. M2[, id := .I]; setkeyv(M2,names(V2)); M2[V2]$id -> r – wolfsatthedoor Sep 18 '15 at 17:49
  • 2
    `M2[, id := .I]` creates a row id column in M2 (same as `M$mid = seq(nrow(M))` inside `merge_df`). `setkeyv(M2,names(V2))` prepares `M2` for merging with `V2` by sorting on columns `V1:V20`, which is necessary in data.table 1.9.4. `M2[V2]` does the merge -- each row of the result corresponds to a row of `V2`, so we can just take `$id` and it will be sorted appropriately (unlike in `merge_df`) – Frank Sep 18 '15 at 18:01
  • 1
    @robertevansanders sorry about that. 1.9.6 is now on CRAN. Perhaps Jan's suggestion about 'drat' is the way to go... – Arun Sep 22 '15 at 22:20
2

If we switch these to data.frames, then we can use merge to do the trick. Also, we rotate Vmat for easy matching.

haystack <- as.data.frame(M)
haystack$haystack_id <- rownames(haystack)
needle <- as.data.frame(t(Vmat))
needle$needle_id <- rownames(needle)

lookups <- merge(needle, haystack)
lookups <- lookups[order(lookups$needle_id), ]

If we compare this to the string/match solution above, it appears to be faster by a reasonable degree

create_data <- function(haystack.rows, cols, needle.rows) {
   M <- matrix(rpois(haystack.rows * cols, 5), haystack.rows, cols)
   V <- M[sample(1:haystack.rows, needle.rows, replace=T),]
   list(M=M, V=V)
}

> set.seed(1); data <- create_data(1000000, 20, 10000);
> system.time({haystack <- as.data.frame(data$M); haystack$hid <- seq_along(haystack$V1); needle <- as.data.frame(data$V); needle$nid <- seq_along(needle$V1); ret <- merge(needle, haystack); ret <- ret[order(ret$nid),]})
   user  system elapsed
  5.900   0.000   5.906

> system.time({mstr <- apply(data$M, 1, paste0, collapse="_"); vstr <- apply(data$V, 1, paste0, collapse="_"); matchstr <- match(vstr, mstr)})
   user  system elapsed
  8.372   0.000   8.377

match on strings is much faster than merge but you have to pay the cost of transforming the data, whereas converting to a data frame is very cheap, since it uses the same underlying data.

EDIT: added a sort step to the merge version to get the rows in order. Also fixed a typo in the timed version of the merge version. Times remained in the same order of magnitude

EDIT2: Thanks to @Frank, found a bug in the match version of the time, which sped up things substantially (I had been using a local example called asdf which was even larger). Still not as fast as the merge solution, though.

user295691
  • 7,108
  • 1
  • 26
  • 35
  • No need to write your edit notes in the body of the answer. They can be included in the edit summary (a small text field visible when editing). Best to just make your answer into its best version, without documenting its history. – Frank Sep 18 '15 at 15:47
  • 1
    @Frank: obviously a matter of opinion; I like having the edit notes because while the question is being discussed, I don't like it when things have changed that make my analysis of the answer invalid without warning. I could probably make a good case for removing the `EDIT` comments once an answer is accepted. – user295691 Sep 18 '15 at 15:50
  • Fair enough. Might want to add a `identical(ret$hid, matchstr)` to confirm that we're doing the same (presumably right) thing. Pretty standard for benchmarking. – Frank Sep 18 '15 at 15:55
  • This speed is compared with Frank's answer , right? Is there theoretically anything faster than using match? I don't know much about matching algorithms' computational complexity. – wolfsatthedoor Sep 18 '15 at 16:01
  • How is the OP getting reversed results, I wonder. Is their example data substantially different? – Frank Sep 18 '15 at 16:30