2

I have example data as below (real data is 150x200), and need to keep the best combination of rows and columns that would give me least NAs. I could use complete.cases() but it removes too many rows.

Just by looking at this example, it is obvious to exclude row x6 as it has most NA count. Similarly, we can exclude column A and F, as they have most NA count.

Need some hint on logic, doesn't have to be full code solution.

#reproducible data
df <- read.csv(text="
SampleID,A,B,C,D,E,F
x1,NA,x,NA,x,NA,x
x2,x,x,NA,x,x,NA
x3,NA,NA,x,x,x,NA
x4,x,x,x,NA,x,x
x5,x,x,x,x,x,x
x6,NA,NA,NA,x,NA,NA
x7,x,x,x,NA,x,x
x8,NA,NA,x,x,x,x
x9,x,x,x,x,x,NA
x10,x,x,x,x,x,x
x11,NA,x,x,x,x,NA")

# complete cases
df[ complete.cases(df),]
#   SampleID A B C D E F
#5        x5 x x x x x x
#10      x10 x x x x x x

Additional info: This is a data for risk calculation, rows are samples and columns are variables. Each variable has a risk factor of some value. Risk predicting algorithm (computed using different custom software) can work with say, with 5 variables or with 200. The more variables will give obviously more reliable answer. To be able to have comparable results most samples should have most overlapping variables. I will need to keep at least ~60% of samples - rows.

A5C1D2H2I1M1N2O1R2T1
  • 190,393
  • 28
  • 405
  • 485
zx8754
  • 52,746
  • 12
  • 114
  • 209
  • 2
    You could use `sum(is.na(...))`... but without understanding what you're trying to do and what your criteria is for "too many NA" it is hard to come up with a reasonable answer... – Justin Dec 09 '13 at 14:40
  • 1
    Building on Justin's suggestion, I would recommend using row and col sums and define a function that specifies your threshold of too many. – A5C1D2H2I1M1N2O1R2T1 Dec 09 '13 at 14:47
  • 1
    Without further requirements, I don't think this is a solvable problem. E.g. what's the minimum subset you need, does it have to be contiguous, and most important: what are you trying to achieve? How do you justify removing some data but not others on the basis of `NA` values which have no apparent causal relationship? – Carl Witthoft Dec 09 '13 at 14:49
  • Agree, question is vague... desired output would be: if you keep columns `x,y,z` and rows `5,8,9,20` you will have least `NA` count proportion to all data. – zx8754 Dec 09 '13 at 14:49
  • Still not helpful: you could remove **all** rows and columns with one or more `NA` to get the maximum possible ratio. Doubtful that will give you a valid analysis. What do your data mean in the first place? – Carl Witthoft Dec 09 '13 at 14:51
  • Additional info added, hope it helps. – zx8754 Dec 09 '13 at 14:58
  • 1
    Interesting - you may be looking for a variant of edge-connectedness, as in http://stackoverflow.com/questions/12630521/grouping-a-many-to-many-relationship-from-a-two-column-map (thanks to Blue Magister for identifying this link) – Carl Witthoft Dec 09 '13 at 15:41

2 Answers2

1

Here's a function for you to play with. Hopefully it's something you can modify to suit your needs:

almostComplete <- function(dataset, rowPct, colPct = rowPct, n = 1) {
  if (sum(is.na(dataset)) == 0) out <- dataset
  else {
    CS <- colSums(is.na(dataset))/ncol(dataset)
    RS <- rowSums(is.na(dataset))/nrow(dataset)
    if (is.null(rowPct)) rowPct <- head(sort(RS, decreasing=TRUE), n)[n]
    if (is.null(colPct)) colPct <- head(sort(CS, decreasing=TRUE), n)[n]

    dropCols <- which(CS >= colPct)
    dropRows <- which(RS >= rowPct)
    out <- dataset[setdiff(sequence(nrow(dataset)), dropRows),
                   setdiff(sequence(ncol(dataset)), dropCols)]
  }
  out
}

A few examples...

Drop whichever row and column have the highest percentage of NA values:

almostComplete(df, NULL, NULL)
#    SampleID    B    C    D    E
# 1        x1    x <NA>    x <NA>
# 2        x2    x <NA>    x    x
# 3        x3 <NA>    x    x    x
# 4        x4    x    x <NA>    x
# 5        x5    x    x    x    x
# 7        x7    x    x <NA>    x
# 8        x8 <NA>    x    x    x
# 9        x9    x    x    x    x
# 10      x10    x    x    x    x
# 11      x11    x    x    x    x

Drop the rows and columns which have more than the second highest percentage of NA values.

almostComplete(df, NULL, NULL, n = 2)
#    SampleID    B    C    D E
# 2        x2    x <NA>    x x
# 4        x4    x    x <NA> x
# 5        x5    x    x    x x
# 7        x7    x    x <NA> x
# 8        x8 <NA>    x    x x
# 9        x9    x    x    x x
# 10      x10    x    x    x x
# 11      x11    x    x    x x

Set one threshold value for both rows and columns.

almostComplete(df, .7)
#    SampleID    B    C    D    E
# 1        x1    x <NA>    x <NA>
# 2        x2    x <NA>    x    x
# 3        x3 <NA>    x    x    x
# 4        x4    x    x <NA>    x
# 5        x5    x    x    x    x
# 6        x6 <NA> <NA>    x <NA>
# 7        x7    x    x <NA>    x
# 8        x8 <NA>    x    x    x
# 9        x9    x    x    x    x
# 10      x10    x    x    x    x
# 11      x11    x    x    x    x

Specify row and column threshold values separately.

almostComplete(df, .2, .5)
#    SampleID    B    C    D E
# 2        x2    x <NA>    x x
# 4        x4    x    x <NA> x
# 5        x5    x    x    x x
# 7        x7    x    x <NA> x
# 8        x8 <NA>    x    x x
# 9        x9    x    x    x x
# 10      x10    x    x    x x
# 11      x11    x    x    x x

Not knowing exactly how you're going to use this, I'm not sure if this is useful or not....

A5C1D2H2I1M1N2O1R2T1
  • 190,393
  • 28
  • 405
  • 485
  • This is funny, while waiting I came up to almost the same solution, of course mine is not as beautifully packed into function as yours, thank you AnandoMahto. – zx8754 Dec 09 '13 at 16:09
  • 1
    @zx8754, cool. It was an interesting problem to think about, especially providing an "automatic" setting. – A5C1D2H2I1M1N2O1R2T1 Dec 09 '13 at 16:36
0

Maybe you could use apply with the function :

count.na <- function(vec) { return (length(which(is.na(vec)))) }

And finaly choose colums and rows with a percentage of NA

sandikou
  • 133
  • 1
  • 11