find and delete rows which have greater than 5% missing data

Question

I have a matrix (called results) that looks like this

     id1 id2 id3 id4 id5 id6 id7 id8 id9
snp1  1   2   0   NA  1   1   1   2   1
snp2  2   2   2   2   0   2   NA  NA  0
snp3  NA  NA  1   NA  0   NA  NA  2   2

So far, I have deleted rows and columns that were completely filled with NAs using

indexsnp=apply(results,1,
function(x) length(which(is.na(x)==T)))
indexsnp=which(indexsnp==length(results[1,]))
indexsample=apply(results,2,
function(x) length(which(is.na(x)==T)))
indexsample=which(indexsample==length(results[,1]))

#get rid of indexes
results=results[-indexsnp,]
results=results[,-indexsample]

I still have a lot of NAs in my dataset, so now I would like to see which snp have call rates below 95% (i.e. which rows consist of more than 5% NAs), and then delete those rows. I'm not sure how to do this. I have tried

snpsum.col <- col.summary(results)
library(snpStats)
call <- 0.95
use <- with(snpsum.col, (!is.na(Call.rate) & Call.rate >= call))
use[is.na(use)] <- FALSE              
cat(ncol(results)-sum(use),"SNPs will be removed due to low call  
rate.\n")
genotype <- genotype[,use]
snpsum.col <- snpsum.col[use,]

but I get the error

Error in col.summary(results) : not a SnpMatrix object

Is there another way I can do this?

Isn't this just `results[rowSums(is.na(results)) < (ncol(results) * .05), ]` ? Your code for removing rows completely filled with `NA` is also very inefficient and unnecessary. Just use `na.omit` or `complete.cases` — David Arenburg, May 01 '16 at 09:40
@DavidArenburg is there a way to see which rows (or just how many) have been deleted if I do it that way? — E_Schyler, May 01 '16 at 09:56
`which(rowSums(is.na(results)) > (ncol(results) * .05))`. Or wrap it up into `sum` if you want to know "how many". And see [this](http://stackoverflow.com/questions/4862178/remove-rows-with-nas-in-data-frame) for the first part of the question — David Arenburg, May 01 '16 at 10:00
Thank you @DavidArenburg . So just to clarify, and I know this is extremely stupid but I am very new to R so please bear with me I just want to make sure I'm understanding. If I do newresults = results[rowSums(is.na(results)) < (ncol(results) * .05), ] then newresults will the the same matrix, but only with rows that have less than 5% NAs then if I do sum(rowSums(is.na(results)) > (ncol(results) * .05)) the answer I get back is how many rows I just removed? — E_Schyler, May 01 '16 at 10:14
Yes. You can just try out the code and see for yourself I guess. — David Arenburg, May 01 '16 at 10:17
Btw, if you removing the rows, you can skip the first part too. Because if you removing all the rows with at least 5% of missing data, then the rows that have all missing data will obviously removed too. So you can do the whole thing in one line instead of that whole process you've written above. — David Arenburg, May 01 '16 at 10:21

score 1 · Accepted Answer · answered May 01 '16 at 18:21

1

If m is such a matrix, do

m <- m[is.na(m)%*%rep(1,ncol(m))<=ncol(m)*0.05,]

answered May 01 '16 at 18:21

user31264

6,557
3
26
40

find and delete rows which have greater than 5% missing data

1 Answers1