For a longitudinal dataset, I have a binary matrix (df
) indicating whether data is available for cases (rows) across time points (columns). I would like to find the optimal subset where at least 2/3 of each row and column == 1. The issue I run into is that they depend on one another (i.e., the columns for which at least 2/3 of the rows have data available change as soon as I remove a row that has less than 2/3 data available, and vice versa).
# data structure example:
set.seed(42)
df <- as.data.table(matrix(rbinom(10*5,1,.66), ncol=10, nrow=5))
df
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
1: 0 1 1 0 0 1 0 0 1 0
2: 0 0 0 0 1 1 0 1 1 0
3: 1 1 0 1 0 0 1 1 1 1
4: 0 1 1 1 0 1 0 0 0 0
5: 1 0 1 1 1 0 1 1 1 1
From the tags of similar questions, this seems like an integer programming problem or multi-objective optimization issue. Unfortunately, I am not super familiar with either of those approaches. Intuitively, I would like to simultaneously maximize rowMeans
and colMeans
with a >= .66 constraint — but I'm not sure whether that is really the most productive approach here.
So far, I tried to adapt the approaches of similar questions. With a brute force approach based on this threat, I was able to find optimal row and column subsets:
best.list.row.df = list()
for (i in 1:nrow(df)) {
# get best subset for rows based on how many columns have more than 66% data
rowlist = combn(nrow(df), i)
numobs = apply(rowlist, 2, function(x) sum(colMeans(df[x,])*100 >= 66))
cat("For subsets of", i, "rows, the highest number of observations is", max(numobs), "out of the", ncol(df), "maximum. Product =", i*max(numobs),"\n")
best = which(numobs == max(numobs))[1]
best.list.row.df = c(best.list.row.df, list(rowlist[, best]))
}
> For subsets of 1 rows, the highest number of observations is 8 out of the 10 maximum. Product = 8
> For subsets of 2 rows, the highest number of observations is 6 out of the 10 maximum. Product = 12
> For subsets of 3 rows, the highest number of observations is 8 out of the 10 maximum. Product = 24
> For subsets of 4 rows, the highest number of observations is 4 out of the 10 maximum. Product = 16
> For subsets of 5 rows, the highest number of observations is 1 out of the 10 maximum. Product = 5
best.list.col.df = list()
for (i in 1:ncol(df)) {
# get best subset for columns based on how many rows have more than 66% data
collist = combn(ncol(df), i)
numobs = apply(collist, 2, function(x) sum(rowMeans(df[, ..x])*100 >= 66))
cat("For subsets of", i, "columns, the highest number of participants is", max(numobs), "out of the", nrow(df), "maximum. Product =", i*max(numobs),"\n")
best = which(numobs == max(numobs))[1]
best.list.col.df = c(best.list.col.df, list(collist[, best]))
}
> For subsets of 1 columns, the highest number of participants is 4 out of the 5 maximum. Product = 4
> For subsets of 2 columns, the highest number of participants is 3 out of the 5 maximum. Product = 6
> For subsets of 3 columns, the highest number of participants is 5 out of the 5 maximum. Product = 15
> For subsets of 4 columns, the highest number of participants is 4 out of the 5 maximum. Product = 16
> For subsets of 5 columns, the highest number of participants is 2 out of the 5 maximum. Product = 10
> For subsets of 6 columns, the highest number of participants is 4 out of the 5 maximum. Product = 24
> For subsets of 7 columns, the highest number of participants is 2 out of the 5 maximum. Product = 14
> For subsets of 8 columns, the highest number of participants is 2 out of the 5 maximum. Product = 16
> For subsets of 9 columns, the highest number of participants is 2 out of the 5 maximum. Product = 18
> For subsets of 10 columns, the highest number of participants is 2 out of the 5 maximum. Product = 20
Based on these results I would select the provided solution of three rows and six columns, as those would individually give me the most valid data.
The problem with this approach is: (1) the combn()
function completely falls apart for my larger data frames (up to 71 X 155). (2) It still does not address the two "optimizations" at the same time.
Other potentially related question:
How to optimize intersect of rows and columns in a matrix?
I really hope I was able to adequately describe my aims here. Any suggestions or thoughts would be greatly appreciated. Thank you in advance already :)