data.table
is fast for relational tables, but a basic matrix
is really fast. My suggestion is to store the tables as matrices and use the simpler subsetting to compare sub-matrices.
Starting with some example data: bigmat
is the large matrix we'll look for matches inside, smallmat_in
is a submatrix of bigmat
, and smallmat_out
is a matrix that's not inside bigmat
.
bigmat <- matrix(c(1:50, 1:50), nrow = 10)
smallmat_in <- bigmat[6:8, 2:3]
smallmat_out <- smallmat_in
smallmat_out[6] <- 0
bigmat
# [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
# [1,] 1 11 21 31 41 1 11 21 31 41
# [2,] 2 12 22 32 42 2 12 22 32 42
# [3,] 3 13 23 33 43 3 13 23 33 43
# [4,] 4 14 24 34 44 4 14 24 34 44
# [5,] 5 15 25 35 45 5 15 25 35 45
# [6,] 6 16 26 36 46 6 16 26 36 46
# [7,] 7 17 27 37 47 7 17 27 37 47
# [8,] 8 18 28 38 48 8 18 28 38 48
# [9,] 9 19 29 39 49 9 19 29 39 49
# [10,] 10 20 30 40 50 10 20 30 40 50
smallmat_in
# [,1] [,2]
# [1,] 16 26
# [2,] 17 27
# [3,] 18 28
smallmat_out
# [,1] [,2]
# [1,] 16 26
# [2,] 17 27
# [3,] 18 0
Instead of trying to iterate over every possible 3x2 submatrix of bigmat
, we can quickly find which elements of bigmat
can be a top-left corner of a matching submatrix.
index_matching_first <- function(small, big) {
max_big_row <- nrow(big) - nrow(small) + 1
max_big_col <- ncol(big) - ncol(small) + 1
valid_rows <- seq_len(max_big_row)
valid_cols <- seq_len(max_big_col)
which(big[valid_rows, valid_cols] == small[[1]], arr.ind = TRUE)
}
index_matching_first(smallmat_in, bigmat)
# row col
# [1,] 6 2
# [2,] 6 7
index_matching_first(smallmat_out, bigmat)
# row col
# [1,] 6 2
# [2,] 6 7
smallmat_in
and smallmat_out
only differ in their last elements, so they have the same matches for their first elements. Next, we'll define a function that takes a small matrix (small
), a large matrix (big
), and a row-column pair (big_first_index
). If the row-column pair is the top-left corner of a submatrix of big
that matches small
, return TRUE
. Otherwise, FALSE
.
is_matrix_match <- function(small, big, big_first_index) {
row_indices <- seq(big_first_index[1], by = 1, length.out = nrow(small))
col_indices <- seq(big_first_index[2], by = 1, length.out = ncol(small))
all(small == big[row_indices, col_indices])
}
is_matrix_match(smallmat_in, bigmat, c(6, 2))
# [1] TRUE
is_matrix_match(smallmat_out, bigmat, c(6, 2))
# [1] FALSE
So this works when we feed it a row-column pair. We can now apply this function iteratively over the output of index_matching_first(...)
and see if any matches are found.
in_matrix <- function(small, big) {
first_matches <- index_matching_first(small, big)
is_same <- apply(
first_matches,
MARGIN = 1,
FUN = is_matrix_match,
small = small,
big = big
)
any(is_same)
}
in_matrix(smallmat_in, bigmat)
# [1] TRUE
in_matrix(smallmat_out, bigmat)
# [1] FALSE
Because this is a proof-of-concept, it doesn't have any checks (like making sure big
is actually bigger than small
). Those would be nice to have in production settings.
I don't know how large of matrices you're working with, but here are some of my speed measurements for a larger matrix:
hugemat <- matrix(rep_len(1:7, 1e7), nrow = 10)
format(object.size(hugemat), "MB")
# [1] "38.1 Mb"
huge_submat <- hugemat[2:9, 200:300]
huge_not_submat <- huge_submat
huge_not_submat[] <- 1
system.time(in_matrix(huge_submat, hugemat))
# user system elapsed
# 10.51 0.00 10.53
system.time(in_matrix(huge_not_submat, hugemat))
# user system elapsed
# 10.62 0.00 10.69