0

For a longitudinal dataset, I have a binary matrix (df) indicating whether data is available for cases (rows) across time points (columns). I would like to find the optimal subset where at least 2/3 of each row and column == 1. The issue I run into is that they depend on one another (i.e., the columns for which at least 2/3 of the rows have data available change as soon as I remove a row that has less than 2/3 data available, and vice versa).

# data structure example:
set.seed(42)
df <- as.data.table(matrix(rbinom(10*5,1,.66), ncol=10, nrow=5))

df
   V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
1:  0  1  1  0  0  1  0  0  1   0
2:  0  0  0  0  1  1  0  1  1   0
3:  1  1  0  1  0  0  1  1  1   1
4:  0  1  1  1  0  1  0  0  0   0
5:  1  0  1  1  1  0  1  1  1   1

From the tags of similar questions, this seems like an integer programming problem or multi-objective optimization issue. Unfortunately, I am not super familiar with either of those approaches. Intuitively, I would like to simultaneously maximize rowMeans and colMeans with a >= .66 constraint — but I'm not sure whether that is really the most productive approach here.

So far, I tried to adapt the approaches of similar questions. With a brute force approach based on this threat, I was able to find optimal row and column subsets:

best.list.row.df = list()
for (i in 1:nrow(df)) {
  # get best subset for rows based on how many columns have more than 66% data
  rowlist = combn(nrow(df), i)
  numobs = apply(rowlist, 2, function(x) sum(colMeans(df[x,])*100 >= 66))
  cat("For subsets of", i, "rows, the highest number of observations is", max(numobs), "out of the", ncol(df), "maximum. Product =", i*max(numobs),"\n")
  best = which(numobs == max(numobs))[1]
  best.list.row.df = c(best.list.row.df, list(rowlist[, best]))
}
> For subsets of 1 rows, the highest number of observations is 8 out of the 10 maximum. Product = 8 
> For subsets of 2 rows, the highest number of observations is 6 out of the 10 maximum. Product = 12 
> For subsets of 3 rows, the highest number of observations is 8 out of the 10 maximum. Product = 24 
> For subsets of 4 rows, the highest number of observations is 4 out of the 10 maximum. Product = 16 
> For subsets of 5 rows, the highest number of observations is 1 out of the 10 maximum. Product = 5 

best.list.col.df = list()
for (i in 1:ncol(df)) {
  # get best subset for columns based on how many rows have more than 66% data
  collist = combn(ncol(df), i)
  numobs = apply(collist, 2, function(x) sum(rowMeans(df[, ..x])*100 >= 66))
  cat("For subsets of", i, "columns, the highest number of participants is", max(numobs), "out of the", nrow(df), "maximum. Product =", i*max(numobs),"\n")
  best = which(numobs == max(numobs))[1]
  best.list.col.df = c(best.list.col.df, list(collist[, best]))
}
> For subsets of 1 columns, the highest number of participants is 4 out of the 5 maximum. Product = 4 
> For subsets of 2 columns, the highest number of participants is 3 out of the 5 maximum. Product = 6 
> For subsets of 3 columns, the highest number of participants is 5 out of the 5 maximum. Product = 15 
> For subsets of 4 columns, the highest number of participants is 4 out of the 5 maximum. Product = 16 
> For subsets of 5 columns, the highest number of participants is 2 out of the 5 maximum. Product = 10 
> For subsets of 6 columns, the highest number of participants is 4 out of the 5 maximum. Product = 24 
> For subsets of 7 columns, the highest number of participants is 2 out of the 5 maximum. Product = 14 
> For subsets of 8 columns, the highest number of participants is 2 out of the 5 maximum. Product = 16 
> For subsets of 9 columns, the highest number of participants is 2 out of the 5 maximum. Product = 18 
> For subsets of 10 columns, the highest number of participants is 2 out of the 5 maximum. Product = 20 

Based on these results I would select the provided solution of three rows and six columns, as those would individually give me the most valid data.

The problem with this approach is: (1) the combn() function completely falls apart for my larger data frames (up to 71 X 155). (2) It still does not address the two "optimizations" at the same time.

Other potentially related question:
How to optimize intersect of rows and columns in a matrix?

I really hope I was able to adequately describe my aims here. Any suggestions or thoughts would be greatly appreciated. Thank you in advance already :)

  • You will need to define what an `optimal subset' is. The missingness, as you describe it, is a constraint for a valid subset, but you will need an objective to optimize. Perhaps this answer is helpful https://stackoverflow.com/questions/64110899/maximum-determinant-of-sub-matrix/64392200#64392200 – Enrico Schumann Jun 15 '22 at 08:48
  • Thank you @EnricoSchumann. I have added an illustration of what I mean with 'optimal subset', basically finding the rows and columns for which I have the most valid data. If there are several solutions with equal performance, I would probably choose the ones with more rows and/or with fewer consecutive zeros per row. I have also tried your suggested TA approach but have not been able to get it to work for my non-square matrices. But I really like the approach. – JannisOverflows Jun 16 '22 at 11:15

0 Answers0