Like the title says, I wish to use lapply instead of a for loop to parse data from a data frame and put it into an empty data frame. My motivation is that the data frame I'm parsing contains thousands of genes and I've read that the apply functions are faster at iterating through large tables.
### My data table ###
rawCounts <- data.frame(ensembl_gene_id_version = c('ENSG00000000003.15', 'ENSG00000000005.6', 'ENSG00000000419.14'),
HS1 = c(1133, 0, 1392),
HS2 = c(900, 0, 1155),
HS3 = c(1251, 0, 2011),
HS4 = c(785, 0, 1022),
stringsAsFactors = FALSE)
## Function
extract_counts <- function(df, esdbid){
counts <- data.frame()
plyr::ldply(esdbid, function(i) {counts <- df[grep(pattern = i, x = df),] %>% rbind()})
return(counts)
}
## Call the first one
extract_counts(df = rawCounts, esdbid = c('ENSG00000000003.15'))
I want this to return a data frame, so I used the plyr::ldply function from this post - Extracting outputs from lapply to a dataframe
However, it isn't returning anything. Eventually I want to scale up my esdbid vector to include multiple values; such as any combination of gene IDs to quickly retrieve the gene counts.
Strangely, when I run this in the console it appears to work as intended for a vector of length 1, i.e.;
esdbid <- 'ENSG00000000003.15'
plyr::ldply(esdbid, function(i) {counts <- rawCounts[grep(pattern = i, x = rawCounts),] %>% rbind()})
Returns a data frame with the correct values. However, when I increase the length of the vector it returns only the first value for each row. For example if esdbid <- c('ENSG00000000003.15', 'ENSG00000000005.6', 'ENSG00000000419.14')
then the console code will return the values for ENSG00000000003.15 three times.