0

I have difficulty in understanding this piece of code :

The inner apply function gives the rows of indata which contain NAs, but I don't know what the outer apply function do ? why summing over columns ?

This piece of code is for removing NAs from indata .

na.rows = which( apply( apply( indata, 1, is.na ), 2, sum ) > 0 )   
if( length( na.rows ) > 0 )
    {
        indata = indata[ -na.rows, ]
        cat( "\n!!Removed NAs from data set!!\n" ); flush.console()
    }
Paul Hiemstra
  • 59,984
  • 12
  • 142
  • 149
user2806363
  • 2,513
  • 8
  • 29
  • 48
  • 1
    Please make your example reproducible, we do not have the `indata` object. See: http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example . – Paul Hiemstra Jan 05 '14 at 14:40
  • Do you try to remove **columns** containing `NA`'s or **rows** containing `NA`'s. The outer `apply` suggest columns, but `[-na.rows,]` inside the `if` statement suggests rows. – Paul Hiemstra Jan 05 '14 at 14:58

2 Answers2

3

Create a small example and split your apply in 2 calls to see what happens?

## a samm reproducible example
> set.seed(1)
> (indata <- matrix(sample(c(1,NA),9,rep=TRUE),ncol=3))
     [,1] [,2] [,3]
[1,]    1   NA   NA
[2,]    1    1   NA
[3,]   NA   NA   NA
## first apply
> (res <- apply( indata, 1, is.na ))
      [,1]  [,2] [,3]
[1,] FALSE FALSE TRUE
[2,]  TRUE FALSE TRUE
[3,]  TRUE  TRUE TRUE
## second apply
> apply(res, 2, sum ) 
[1] 2 1 3

your code compute the number of non-missing values per column. Can be written in a vectorized manner( more efficient):

rowSums(is.na(indata))
[1] 2 1 3
agstudy
  • 119,832
  • 17
  • 199
  • 261
  • 1
    +1, `rowSums` is a nice vectorized solution. I find using `any` plus an `apply` loop a bit clearer. When reading this code, the reader must remember that using `sum` on a logical vector transforms `TRUE` to `1` and FALSE to `0`. – Paul Hiemstra Jan 05 '14 at 14:51
2
na.rows = which( apply( apply( indata, 1, is.na ), 2, sum ) > 0 )

It seems to me that the code above finds the rows where any of the elements is a NA. The inner loops finds the NA, and returns a boolean matrix. The outer apply loop sums over the rows, leading to the number of NA's per row. The > 0 transforms this to a logical vector, and which transforms this in turn to the indices where any of the elements in a row are NA.

Do note this code can be made quite a bit simpler:

na.rows = which(apply(is.na(indata), 1, any))

Just sticking to a boolean vector makes it even simpler. All your code above can be replaced by:

na.rows = apply(is.na(indata), 1, any))
indata = indata[!na.rows]
if(any(na.rows)) cat( "\n!!Removed NAs from data set!!\n" ); flush.console()

Which eliminates the outer apply loop, and the which statement.

Paul Hiemstra
  • 59,984
  • 12
  • 142
  • 149
  • Hmm, the original question is a bit vague. The `[-na.rows,]` suggest rows, but the outer `apply` loop detects `NA`'s in the columns. I think what the OP wants is `NA`'s per row, so I'll change the `2` to `1` in the `apply` loops. – Paul Hiemstra Jan 05 '14 at 14:57
  • No, I think you where right, but the OP needs to clarify his needs. – Paul Hiemstra Jan 05 '14 at 14:59