Removing all NAs while retaining the most data possible

Question

I have a 37x21 matrix in R which contains many NAs. For my analysis, I need to get rid of all the NAs. I could remove all rows containing an NA, all columns containing an NA, or some combination of the two.

I want to remove specific rows and columns in such a way that I remove all NAs but retain the highest number of data cells possible.

E.g. Removing all ROWS with an NA results in a 10x21 matrix (10*21 = 210 cells of data). Removing all COLUMNS with an NA results in a 37x12 matrix (37x12 = 444 cells of data). But instead of doing either of these extremes, I want to remove the combination of rows and columns that results in the highest number of cells of data being retained. How would I go about this?

As a starter see [How to make a great R reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) ;) — markus, Jul 06 '19 at 19:27
@jay.sf I think this example is a bit more complicated than what I'm asking. I just want to remove specific rows/columns such that all NAs are removed while retaining the maximum possible cells in the matrix. — bob, Jul 06 '19 at 19:35
Do you have a proposed algorithm? I imagine one way would be to remove rows/columns containing NA starting with those with the fewest values. It would help if you provided your data — Calum You, Jul 06 '19 at 20:52
Your problem has no "constraints". Without them, not removing any rows or columns leave you with highest number of non NA cells OR you could only remove rows and columns full of `NA`. — Shree, Jul 06 '19 at 20:53
@Shree the constraint is that you must remove all NAs, but for any given NA cell should you remove the row or the column containing it? I think the problem is defined but needs some example data — Calum You, Jul 06 '19 at 20:56

score 2 · Accepted Answer · answered Jul 06 '19 at 22:51

Here is one way using the first algorithm that I could think of. The approach is just to remove a row or column in an iteration if it has at least one NA and the fewest non-NA values in the matrix (so you lose the fewest cells when removing the row/column). To do this, I make a dataframe of the rows and columns with their counts of NA and non-NA along with dimension and index. At the moment, if there is a tie it resolves by deleting rows before columns and earlier indexes before later.

I am not sure that this will give the global maximum (e.g. only takes one branch at ties) but it should do better than just deleting rows/columns. In this example we get 210 for deleting rows, 74 for deleting columns but 272 with the new approach. The code could also probably be optimised if you need to use this for much larger matrices or for many more NA.

set.seed(1)
mat <- matrix(sample(x = c(1:10, NA), size = 37 * 21, replace = TRUE), ncol = 21)
# filter rows
prod(dim(mat[apply(mat, 1, function(x) all(!is.na(x))), ]))
#> [1] 210
# filter cols
prod(dim(mat[, apply(mat, 2, function(x) all(!is.na(x)))]))
#> [1] 74

delete_row_col <- function(m) {
  to_delete <- rbind(
    data.frame(
      dim = "row",
      index = seq_len(nrow(m)),
      nas = rowSums(is.na(m)),
      non_nas = rowSums(!is.na(m)),
      stringsAsFactors = FALSE
    ),
    data.frame(
      dim = "col",
      index = seq_len(ncol(m)),
      nas = colSums(is.na(m)),
      non_nas = colSums(!is.na(m)),
      stringsAsFactors = FALSE
    )
  )
  to_delete <- to_delete[to_delete$nas > 0, ]
  to_delete <- to_delete[to_delete$non_nas == min(to_delete$non_nas), ]

  if (nrow(to_delete) == 0) {
    return(m) 
  }
  else if (to_delete$dim[1] == "row") {
    m <- m[-to_delete$index[1], ]
  } else {
    m <- m[, -to_delete$index[1]]
  }
  return(m)
}

remove_matrix_na <- function(m) {
  while (any(is.na(m))) {
    m <- delete_row_col(m)
  }
  return(m)
}

prod(dim(remove_matrix_na(mat)))
#> [1] 272

^{Created on 2019-07-06 by the reprex package (v0.3.0)}

Shree · Answer 2 · 2019-08-21T17:08:54.117

Here's a way using mixed integer programming (MIP). I have used ompr package for mathematical modeling and open source "glpk" solver. I have added model explanation as comments in the code. MIP approaches, when successful, guarantee optimal solution as indicated by solver_status(model) shown in code.

This approach will easily scale up to handle large matrices.

library(dplyr)
library(ROI)
library(ROI.plugin.glpk)
library(ompr)
library(ompr.roi)

set.seed(1)
mat <- matrix(sample(x = c(1:10, NA), size = 37 * 21, replace = TRUE), ncol = 21)
# filtering all rows with NA retains 126 cells
prod(dim(mat[apply(mat, 1, function(x) all(!is.na(x))), , drop = F]))
# [1] 126
# filtering all cols with NA retains 37 cells
prod(dim(mat[, apply(mat, 2, function(x) all(!is.na(x))), drop = F]))
# [1] 37

m <- +!is.na(mat) # gets logical matrix; 0 if NA else 1    
nr <- nrow(m)
nc <- ncol(m)

model <- MIPModel() %>% 
  # keep[i,j] is 1 if matrix cell [i,j] is to be kept else 0
  add_variable(keep[i,j], i = 1:nr, j = 1:nc, typ = "binary") %>% 
  # rm_row[i] is 1 if row i is selected for removal else 0
  add_variable(rm_row[i], i = 1:nr, type = "binary") %>% 
  # rm_col[j] is 1 if column j is selected for removal else 0
  add_variable(rm_col[j], j = 1:nc, type = "binary") %>% 
  # maximize good cells kept
  set_objective(sum_expr(keep[i,j], i = 1:nr, j = 1:nc), "max") %>% 
  # cell can be kept only when row is not selected for removal
  add_constraint(sum_expr(keep[i,j], j = 1:nc) <= 1 - rm_row[i], i = 1:nr) %>%
  # cell can be kept only when column is not selected for removal
  add_constraint(sum_expr(keep[i,j], i = 1:nr) <= 1 - rm_col[j], j = 1:nc) %>%
  # only non-NA values can be kept
  add_constraint(m[i,j] + rm_row[i] + rm_col[j] >= 1, i = 1:nr, j = 1:nc) %>% 
  # solve using free glpk solver
  solve_model(with_ROI(solver = "glpk"))

Get solution -

solver_status(model)
# [1] "optimal"    <- "optimal" guarnatees optimality

# get rows to remove
rm_rows <- model %>%
  get_solution(rm_row[i]) %>% 
  filter(value > 0) %>% 
  pull(i)

# [1]  1  3  4  6  7  8 10 14 18 19 20 21 22 23 24 28 30 33 34 35 37

# get columns to remove
rm_cols <- model %>%
  get_solution(rm_col[j]) %>% 
  filter(value > 0) %>% 
  pull(j)

# [1]  6 14 15 16 17

result <- mat[-rm_rows, -rm_cols]

# result has retained more cells as compared to
# removing just rows (126) or just columns (37)
prod(dim(result))
# [1] 256

This approach should be possible with lpSolve package as well but I think it involves building constraint matrix manually which is very cumbersome.

Removing all NAs while retaining the most data possible

2 Answers2