why does lapply produce random NA values in this script?

Question

when running this script on just one file in a folder:

emboss<-read.table("emboss_012.ss",header=T)
x<-table(emboss[,2],emboss[,3])/NROW(emboss[,3])
y<-as.vector(t(x))
nms <- expand.grid(colnames(x), rownames(x))
names(y) <- paste( nms[,2],nms[,1],sep="")
write.table(t(y), file = "test3.csv",append=TRUE)

I get the desired result

However doing this in one go for all files in the folder results in random NA's appearing. I am doing this by:

runForAll <- function(x) {
  emboss <- read.table(x,header=T)
  x <- table(emboss[,2],emboss[,3])/NROW(emboss[,3])
  y <- as.vector(t(x))
  nms <- expand.grid(colnames(x), rownames(x))
  names(y) <- paste( nms[,2],nms[,1],sep="")
  return(t(y))
}

my.files <- list.files(pattern = "emboss_\\d+\\.ss")
outputs <- lapply(my.files, FUN = runForAll)   

library(plyr)
one.header.output <- rbind.fill.matrix(outputs)
write.table(one.header.output, file = "nontpsec.csv")

and my files are located here:

https://drive.google.com/folderview?id=0B0iDswLYaZ0zWjQ4RjdnMEUzUW8&usp=sharing

this is very weird and can't why it is happening, especially as all the other data is correct, even when looping through all files in one go.

interesting that calling as.matrix(outputs) shows the length decrease for the offending files. even though when you eyeball each output it shows no missing data. — brucezepplin, Jun 12 '13 at 15:17
In my experience, I achieve best results when I try to construct a minimal, self contained example. The problem becomes evident in 95% of cases (for me). Consider posting such an example (see http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) in the future. — Roman Luštrik, Jun 12 '13 at 16:04

score 2 · Accepted Answer · edited May 23 '17 at 11:50

Your data tables are different lengths, e.g. the first one has 20 rows the last one only 19! This is where the problem comes from.

Here's a little test:

tmp <- c("A", "C", "D", "E", "F", "G", "H", "I", "K", "L", "M", "N", "P", "Q", "R", "S", "T", "V", "W", "Y")

which(rownames(x) %in% tmp)

In the case of files 12 and 13 the second row is missing (label B).

Have a look at this post:

Compare two data.frames to find the rows in data.frame 1 that are not present in data.frame 2

This might work for you:

Fastest way to add rows for missing values in a data.frame?

why does lapply produce random NA values in this script?

1 Answers1