identifying common non-continuous subsequences representing occurrences

Question

Consider the following dataset:

set.seed(50)
d = matrix(rbinom(1000, 1, 0.9), ncol = 20)

Each row corresponds to an object, and each column corresponds to a measurement of the object. For example, rows could be individuals in a study and columns could be repeated measures through time. In this case, measurements are TRUE/FALSE indicating presence or absence of an object.

I am looking for an algorithm that will allow me to identify the maximal collection of rows that have n coincident observations. In other words, I'm looking for a way to filter for the rows that all have n TRUE values in the same columns. A member of the group can have more than n TRUE values though.

The trivial example: the rows with 20 (all) TRUE values are captured by

which(apply(d, 1, all))

which identifies rows 3, 10, 12, 24, 36, 39, 48, 50. Similarly, it's easy to identify all the unique sequences and identify groups that share the same observations:

unique.series = d[which(!duplicated(d)),]
groups = vector("list", nrow(unique.series))
for(i in seq_along(groups))
  groups[[i]] = which(apply(d, 1, function(x) 
    identical(x, unique.series[i,])))

But what if I want all groups with 19 or more observations? For example, groups 3 (rows 3, 10, 12, 24, 36, 39, 48, 50) and 21 (rows 23, 32, 40) only differ by observation 9 (group 3 has an observation, but group 21 does not). How can I programmatically identify series that partially match, i.e. contain some subset of matching observations? This seems like a subsequence matching problem, but it's a bit more abstract because the subsequences don't need to be continuous.

One way might be to use regular expressions, but I can't get it to work right:

unique.strings = lapply(
apply(unique.series, 1, function(x) 
    which(as.logical(x))),
  paste,
  collapse = ","
)
reg.strings = paste0("^", lapply(
  apply(unique.series, 1, function(x) 
    sprintf("(%d)", which(as.logical(x)))), 
  paste, collapse = "+(,[0-9],)*"), "$")  
lapply(unique.strings, grep, x =  unique.strings) # NOT CORRECT

I would appreciate any alternative algorithms, regex-based or other.

score 0 · Answer 1 · answered Nov 07 '17 at 16:51

This isn't a complete answer, but I did get more than halfway there. I abandoned the regex approach and instead adopted a binary matrix approach.

A collection of coincident observations will be represented in the occurrence matrix as a block of TRUE values. I don't require that observations be consecutive, i.e. the row/column ordering of the matrix doesn't matter. Therefore, I can simply rearrange my matrix so that occurrences are grouped in blocks, and then use block detection or clustering algorithms to extract set of observations. There are two components to this procedure: first, rearrange the matrix to make it as "blocky" as possible by column/row swapping. Second, identify the blocks in the arranged matrix.

The arrangement part is actually really easy. I used the seriation package to rearrange the matrix into blocks.

set.seed(50)
d = matrix(as.logical(rbinom(50, 1, 0.75)), ncol = 10)
rownames(d) = LETTERS[1:5]   # individual IDs
colnames(d) = month.abb[1:10]

library(seriation)
o = seriate(d)
d.a = permute(d, o)            # rearranged matrix

I haven't decided on a good approach for the block detection, but there are several SO questions that deal with largest block detection (e.g. 1, 2). I am hoping I can adapt these algorithms to find the largest block of width n or something like that. I will update this answer if I find a good solution.

identifying common *non-continuous* subsequences representing occurrences

1 Answers1

identifying common non-continuous subsequences representing occurrences