1

My problem is the following:

I am looking for the fastest way to find the column and row index of a 2D table inside a larger table. The small table is of size n x m and can occur multiple times in the larger table of j x k with j>n and k>m. I have been trying to do that with the 'data.table' package but failed. My problem is very similar to the following one: Matlab how to find out if a matrix is in another martrix

But I am looking to do this fast in R. I would like to have your views before implementing a brute force approach with for loops. Please note that in the table you may have numbers and strings.

If you need an example. You can consider that I need to search the following data.frame:

data.frame(A=c(1.7,1.5,1.7),B=c(0.3,0.3,0.2),C=c("setosa","setosa","setosa"))

that you have to search in the 'iris' data.frame. The output answer would be: row 19 and column 3.

Frank
  • 66,179
  • 8
  • 96
  • 180
momo123
  • 93
  • 7
  • 1
    take a look at https://stackoverflow.com/questions/31330196/find-a-submatrix-in-a-matrix and if it's what you're looking for i'll mark this a dup but you shld not delete it as it's asked in a different way that might help others find it. – hrbrmstr Oct 30 '17 at 18:41
  • @hrbrmstr, thank you for your reply. The problem is somehow similiar but not sure if it is the same. In the link you provided, 2 answers don't really work or are incomplete and the Rcpp answer returns me errors. I tried to debug it but without success. – momo123 Oct 30 '17 at 21:29

3 Answers3

0

Based on this answer:

library(dplyr)
# Problems with factors when checking equality so strings instead
df1 <- data.frame(A=c(1.7,1.5,1.7),
                  B=c(0.3,0.3,0.2),
                  C=c("setosa","setosa","setosa")) %>%
  dplyr::mutate_if(is.factor, as.character)
df <- iris %>%
  dplyr::mutate_if(is.factor, as.character)


# find all matches of the top left corner of df1
hits <- which(df == df1[1,1], arr.ind=TRUE)
# remove those matches that can't logically fit in the data
hits <- hits[hits[,"row"] <= nrow(df)-nrow(df1)+1 &
               hits[,"col"] <= ncol(df)-ncol(df1)+1, , drop=FALSE]


for (j in seq_len(ncol(df1))) {
  for (i in seq_len(nrow(df1))) {
    if (i == 1 && j == 1) next
    hits <- hits[df[sweep(hits, 2, c(i, j) - 1, '+')] == df1[i, j], , drop = FALSE]
  }
}

This should be fast and simple.

F. Privé
  • 11,423
  • 2
  • 27
  • 78
  • Thank you for your reply. I find it relatively slow on large tables, however I will accept your answer. I will try to implement something faster based on what you provided. – momo123 Oct 31 '17 at 09:09
0

data.table is fast for relational tables, but a basic matrix is really fast. My suggestion is to store the tables as matrices and use the simpler subsetting to compare sub-matrices.

Starting with some example data: bigmat is the large matrix we'll look for matches inside, smallmat_in is a submatrix of bigmat, and smallmat_out is a matrix that's not inside bigmat.

bigmat <- matrix(c(1:50, 1:50), nrow = 10)
smallmat_in <- bigmat[6:8, 2:3]
smallmat_out <- smallmat_in
smallmat_out[6] <- 0

bigmat
#       [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
#  [1,]    1   11   21   31   41    1   11   21   31    41
#  [2,]    2   12   22   32   42    2   12   22   32    42
#  [3,]    3   13   23   33   43    3   13   23   33    43
#  [4,]    4   14   24   34   44    4   14   24   34    44
#  [5,]    5   15   25   35   45    5   15   25   35    45
#  [6,]    6   16   26   36   46    6   16   26   36    46
#  [7,]    7   17   27   37   47    7   17   27   37    47
#  [8,]    8   18   28   38   48    8   18   28   38    48
#  [9,]    9   19   29   39   49    9   19   29   39    49
# [10,]   10   20   30   40   50   10   20   30   40    50

smallmat_in
#      [,1] [,2]
# [1,]   16   26
# [2,]   17   27
# [3,]   18   28

smallmat_out
#      [,1] [,2]
# [1,]   16   26
# [2,]   17   27
# [3,]   18    0

Instead of trying to iterate over every possible 3x2 submatrix of bigmat, we can quickly find which elements of bigmat can be a top-left corner of a matching submatrix.

index_matching_first <- function(small, big) {
  max_big_row <- nrow(big) - nrow(small) + 1
  max_big_col <- ncol(big) - ncol(small) + 1
  valid_rows <- seq_len(max_big_row)
  valid_cols <- seq_len(max_big_col)
  which(big[valid_rows, valid_cols] == small[[1]], arr.ind = TRUE)
}

index_matching_first(smallmat_in, bigmat)
#      row col
# [1,]   6   2
# [2,]   6   7

index_matching_first(smallmat_out, bigmat)
#      row col
# [1,]   6   2
# [2,]   6   7

smallmat_in and smallmat_out only differ in their last elements, so they have the same matches for their first elements. Next, we'll define a function that takes a small matrix (small), a large matrix (big), and a row-column pair (big_first_index). If the row-column pair is the top-left corner of a submatrix of big that matches small, return TRUE. Otherwise, FALSE.

is_matrix_match <- function(small, big, big_first_index) {
  row_indices <- seq(big_first_index[1], by = 1, length.out = nrow(small))
  col_indices <- seq(big_first_index[2], by = 1, length.out = ncol(small))
  all(small == big[row_indices, col_indices])
}

is_matrix_match(smallmat_in, bigmat, c(6, 2))
# [1] TRUE

is_matrix_match(smallmat_out, bigmat, c(6, 2))
# [1] FALSE

So this works when we feed it a row-column pair. We can now apply this function iteratively over the output of index_matching_first(...) and see if any matches are found.

in_matrix <- function(small, big) {
  first_matches <- index_matching_first(small, big)
  is_same <- apply(
    first_matches,
    MARGIN = 1,
    FUN    = is_matrix_match,
    small  = small,
    big    = big
  )
  any(is_same)
}

in_matrix(smallmat_in, bigmat)
# [1] TRUE
in_matrix(smallmat_out, bigmat)
# [1] FALSE

Because this is a proof-of-concept, it doesn't have any checks (like making sure big is actually bigger than small). Those would be nice to have in production settings.

I don't know how large of matrices you're working with, but here are some of my speed measurements for a larger matrix:

hugemat <- matrix(rep_len(1:7, 1e7), nrow = 10)
format(object.size(hugemat), "MB")
# [1] "38.1 Mb"

huge_submat <- hugemat[2:9, 200:300]
huge_not_submat <- huge_submat
huge_not_submat[] <- 1

system.time(in_matrix(huge_submat, hugemat))
#  user  system elapsed
# 10.51    0.00   10.53

system.time(in_matrix(huge_not_submat, hugemat))
#  user  system elapsed
# 10.62    0.00   10.69
Nathan Werth
  • 5,093
  • 18
  • 25
  • To stay with matrix might be good idea but how do you handle OP's requirement: *Please note that in the table you may have numbers and strings.*? – Uwe Nov 03 '17 at 09:43
  • @Uwe My solution doesn't rely on numeric comparisons, so it would also work with character matrices. Any numeric data can be converted to character. – Nathan Werth Nov 03 '17 at 12:55
  • @NathanWerth. Thank you very much for your contribution. I personally think your function is interesting but is still too slow for me. I believe It is possible to divide the computation time by 10 at least. I haven't done it yet... – momo123 Nov 03 '17 at 15:12
0

The function fpos in package kit returns you the position of a small matrix inside a larger one. It also works with vectors. Convert your data.frame to a matrix and then use fpos. Note that the function is implemented in C so it should be fast.

Suresh_Patel
  • 291
  • 3
  • 5