Addressing your original question about the loop over rows, it appears that, up to a point, it's faster to process this by data frame columns rather than rows. I've put your code in a function called func_Row as shown below
func_Row <- function(titanicDF) {
# target variable
y <- titanicDF$Survived
lineHolders <- c()
for ( i in 1:nrow( titanicDF )) {
# find indexes of nonzero values - anything
# with zero for that row needs to be ignored!
indexes = which( as.logical( titanicDF [i,] ))
indexes <- names(titanicDF [indexes])
# nonzero values
values = titanicDF [i, indexes]
valuePairs = paste( indexes, values, sep = ":", collapse = " " )
# add label in the front and newline at the end
output_line = paste0(y[i], " |f ", valuePairs, "\n", sep = "" )
lineHolders <- c(lineHolders, output_line)
}
return(lineHolders)
}
and put together another function which processes by columns
func_Col <- function(titanicDF) {
lineHolders <- paste(titanicDF$Survived, "|f")
for( ic in 1:ncol(titanicDF)) {
nonzeroes <- which(as.logical(as.numeric(titanicDF[,ic])))
lineHolders[nonzeroes] <- paste(lineHolders[nonzeroes]," ",names(titanicDF)[ic], ":", as.numeric(titanicDF[nonzeroes,ic]),sep="")
}
lineHolders <- paste(lineHolders,"\n",sep="")
return(lineHolders)
}
Comparing these two functions using microbenchmark gives the following result
microbenchmark( func_Row(titanicDF), func_Col(titanicDF), times=10)
Unit: milliseconds
expr min lq median uq max neval
func_Row(titanicDF) 370.396605 375.210624 377.044896 385.097586 443.14042 10
func_Col(titanicDF) 6.626192 6.661266 6.675667 6.798711 10.31897 10
Notice that the results are in milliseconds for this set of data. So processing by columns is about 50 times faster than processing by rows. It's fairly straightforward to address the memory issue and retain the benefit of processing by columns by reading the data in blocks of rows. I've created a 5,300,000 row file based on the Titanic data as follows
titanicDF_big <- titanicDF
for( i in 1:12 ) titanicDF_big <- rbind(titanicDF_big, titanicDF_big)
write.table(titanicDF_big, "titanic_big.txt", row.names=FALSE )
This file can then be read in blocks of rows using the following function
read_blocks <- function(file_name, row_max = 6000000L, row_block = 5000L ) {
# Version of code using func_Col to process data by columns
blockDF = NULL
for( row_num in seq(1, row_max, row_block)) {
if( is.null(blockDF) ) {
blockDF <- read.table(file_name, header=TRUE, nrows=row_block)
lineHolders <- func_Col(blockDF)
}
else {
blockDF <- read.table(file_name, header=FALSE, col.names=names(blockDF),
nrows=row_block, skip = row_num - 1)
lineHolders <- c(lineHolders, func_Col(blockDF))
}
}
return(lineHolders)
}
Benchmark results using this version of read_blocks which uses func_Col to process data by columns are given below for reading the entire expanded Titanic data file with block sizes ranging from 500,000 rows to 2,000,000 rows:
Unit: seconds
expr min lq median uq max neval
read_blocks("titanic_big.txt", row_max = 6000000L, row_block = 2000000L) 39.43244 39.43244 39.43244 39.43244 39.43244 1
read_blocks("titanic_big.txt", row_max = 6000000L, row_block = 1000000L) 46.66375 46.66375 46.66375 46.66375 46.66375 1
read_blocks("titanic_big.txt", row_max = 6000000L, row_block = 500000L) 62.51387 62.51387 62.51387 62.51387 62.51387 1
Larger block sizes give noticeably better times but require more memory. However, these results show that by processing the data by columns, the entire 5.3 million row expanded Titanic data file can be read in about a minute or less even with a block size equal to about 10% of the file size. Again, results will depend upon number of columns of data and system properties.