Consider the following dataset:
set.seed(50)
d = matrix(rbinom(1000, 1, 0.9), ncol = 20)
Each row corresponds to an object, and each column corresponds to a measurement of the object. For example, rows could be individuals in a study and columns could be repeated measures through time. In this case, measurements are TRUE/FALSE indicating presence or absence of an object.
I am looking for an algorithm that will allow me to identify the
maximal collection of rows that have n
coincident observations.
In other words, I'm looking for a way to filter for the rows that all
have n
TRUE values in the same columns. A member of the group can
have more than n
TRUE values though.
The trivial example: the rows with 20 (all) TRUE values are captured by
which(apply(d, 1, all))
which identifies rows 3, 10, 12, 24, 36, 39, 48, 50
. Similarly, it's
easy to identify all the unique sequences and identify groups that share
the same observations:
unique.series = d[which(!duplicated(d)),]
groups = vector("list", nrow(unique.series))
for(i in seq_along(groups))
groups[[i]] = which(apply(d, 1, function(x)
identical(x, unique.series[i,])))
But what if I want all groups with 19 or more observations? For example,
groups 3
(rows 3, 10, 12, 24, 36, 39, 48, 50
) and 21
(rows 23, 32, 40
) only differ by observation 9
(group 3 has an observation, but group 21
does not). How can I
programmatically identify series that partially match, i.e. contain some
subset of matching observations? This seems like a subsequence matching
problem, but it's a bit more abstract because the subsequences don't need to be continuous.
One way might be to use regular expressions, but I can't get it to work right:
unique.strings = lapply(
apply(unique.series, 1, function(x)
which(as.logical(x))),
paste,
collapse = ","
)
reg.strings = paste0("^", lapply(
apply(unique.series, 1, function(x)
sprintf("(%d)", which(as.logical(x)))),
paste, collapse = "+(,[0-9],)*"), "$")
lapply(unique.strings, grep, x = unique.strings) # NOT CORRECT
I would appreciate any alternative algorithms, regex-based or other.