Here's a way using mixed integer programming (MIP). I have used ompr
package for mathematical modeling and open source "glpk" solver. I have added model explanation as comments in the code. MIP approaches, when successful, guarantee optimal solution as indicated by solver_status(model)
shown in code.
This approach will easily scale up to handle large matrices.
library(dplyr)
library(ROI)
library(ROI.plugin.glpk)
library(ompr)
library(ompr.roi)
set.seed(1)
mat <- matrix(sample(x = c(1:10, NA), size = 37 * 21, replace = TRUE), ncol = 21)
# filtering all rows with NA retains 126 cells
prod(dim(mat[apply(mat, 1, function(x) all(!is.na(x))), , drop = F]))
# [1] 126
# filtering all cols with NA retains 37 cells
prod(dim(mat[, apply(mat, 2, function(x) all(!is.na(x))), drop = F]))
# [1] 37
m <- +!is.na(mat) # gets logical matrix; 0 if NA else 1
nr <- nrow(m)
nc <- ncol(m)
model <- MIPModel() %>%
# keep[i,j] is 1 if matrix cell [i,j] is to be kept else 0
add_variable(keep[i,j], i = 1:nr, j = 1:nc, typ = "binary") %>%
# rm_row[i] is 1 if row i is selected for removal else 0
add_variable(rm_row[i], i = 1:nr, type = "binary") %>%
# rm_col[j] is 1 if column j is selected for removal else 0
add_variable(rm_col[j], j = 1:nc, type = "binary") %>%
# maximize good cells kept
set_objective(sum_expr(keep[i,j], i = 1:nr, j = 1:nc), "max") %>%
# cell can be kept only when row is not selected for removal
add_constraint(sum_expr(keep[i,j], j = 1:nc) <= 1 - rm_row[i], i = 1:nr) %>%
# cell can be kept only when column is not selected for removal
add_constraint(sum_expr(keep[i,j], i = 1:nr) <= 1 - rm_col[j], j = 1:nc) %>%
# only non-NA values can be kept
add_constraint(m[i,j] + rm_row[i] + rm_col[j] >= 1, i = 1:nr, j = 1:nc) %>%
# solve using free glpk solver
solve_model(with_ROI(solver = "glpk"))
Get solution -
solver_status(model)
# [1] "optimal" <- "optimal" guarnatees optimality
# get rows to remove
rm_rows <- model %>%
get_solution(rm_row[i]) %>%
filter(value > 0) %>%
pull(i)
# [1] 1 3 4 6 7 8 10 14 18 19 20 21 22 23 24 28 30 33 34 35 37
# get columns to remove
rm_cols <- model %>%
get_solution(rm_col[j]) %>%
filter(value > 0) %>%
pull(j)
# [1] 6 14 15 16 17
result <- mat[-rm_rows, -rm_cols]
# result has retained more cells as compared to
# removing just rows (126) or just columns (37)
prod(dim(result))
# [1] 256
This approach should be possible with lpSolve
package as well but I think it involves building constraint matrix manually which is very cumbersome.