1

I am trying to build a fast CSV to Vowpal input format translator. I found some good code related to libsvm and based my example on that. It works great on the small titanic data set provided, but my real data set has over 4.5 mil. observations with 200+ features. It would take three days with the code provided on a powerful server.

Is there a way to remove the single loop here? Keep in mind that Vowpal has its own sparsness so the code needs to check indexes every time to exclude 0 or NAs on every line. (vowpal doesn't need to hold the same number of features in each line unlike a data frame). I'm OK with writing every line to file instead of holding it all in memory. Any solutions would be much appreciated!

# sample data set
titanicDF <- read.csv('http://math.ucdenver.edu/RTutorial/titanic.txt',sep='\t')
titanicDF  <- titanicDF  [c("PClass", "Age", "Sex", "Survived")]

# target variable
y <- titanicDF$Survived
lineHolders <- c()
for ( i in 1:nrow( titanicDF  )) {

    # find indexes of nonzero values - anything 
    # with zero for that row needs to be ignored!
    indexes = which( as.logical( titanicDF [i,] ))
    indexes <- names(titanicDF [indexes])

    # nonzero values
    values = titanicDF [i, indexes]

    valuePairs = paste( indexes, values, sep = ":", collapse = " " )

    # add label in the front and newline at the end
    output_line = paste0(y[i], " |f ", valuePairs, "\n", sep = "" )

    lineHolders <- c(lineHolders, output_line)
}
amunategui
  • 1,130
  • 2
  • 11
  • 15
  • At the very least pre-allocate `lineHolders`. Growing objects in R is about the unwisest thing you could possibly do. Other ideas: prepend the labels up front in a vectorized fashion, along with all the other pieces that will be pasted together. Then re-organize everything in a list of lists and use `lapply`. – joran Aug 21 '14 at 20:53
  • R would not be my first choice to reshape a text file like this if speed was important. Python or Perl might make quicker work of it. But have you profiled this code? Removing a loop isn't guaranteed to make anything faster. The next best thing would be to split the file up and transform chunks in parallel. Then you can cut down your time depending on the number of processors you have access to. – MrFlick Aug 21 '14 at 20:58
  • Thanks guys - the original libsvm script writes out each line to file, avoids memory issues. I may have to experiment with distributed tasks if I can find a good way around concurrent disk writes. Thanks for the input! – amunategui Aug 22 '14 at 00:45
  • Might be easier to write the needed columns to a new .csv and use phraug, which is the tools VW guys themselves have written for the conversion - https://github.com/zygmuntz/phraug – LauriK Nov 03 '14 at 08:52
  • I won't copy/paste, but I made a [solution using `fwrite`](http://stackoverflow.com/a/41215573/3576984) and operating column-wise here... it's still pretty slow on examples I've seen so far... – MichaelChirico Dec 23 '16 at 23:30

1 Answers1

1

Addressing your original question about the loop over rows, it appears that, up to a point, it's faster to process this by data frame columns rather than rows. I've put your code in a function called func_Row as shown below

func_Row  <-  function(titanicDF) {
# target variable
y <- titanicDF$Survived
lineHolders <- c()
for ( i in 1:nrow( titanicDF  )) {
# find indexes of nonzero values - anything 
# with zero for that row needs to be ignored!
 indexes = which( as.logical( titanicDF [i,] ))
 indexes <- names(titanicDF [indexes])
# nonzero values
 values = titanicDF [i, indexes]
 valuePairs = paste( indexes, values, sep = ":", collapse = " " )
# add label in the front and newline at the end
 output_line = paste0(y[i], " |f ", valuePairs, "\n", sep = "" )
 lineHolders <- c(lineHolders, output_line)
} 
return(lineHolders)
}

and put together another function which processes by columns

 func_Col <- function(titanicDF) {
 lineHolders <- paste(titanicDF$Survived, "|f")
 for( ic in 1:ncol(titanicDF)) {
   nonzeroes <- which(as.logical(as.numeric(titanicDF[,ic]))) 
   lineHolders[nonzeroes] <- paste(lineHolders[nonzeroes]," ",names(titanicDF)[ic], ":", as.numeric(titanicDF[nonzeroes,ic]),sep="") 
 }
 lineHolders <- paste(lineHolders,"\n",sep="")
 return(lineHolders)
 }

Comparing these two functions using microbenchmark gives the following result

microbenchmark( func_Row(titanicDF), func_Col(titanicDF), times=10)
Unit: milliseconds
            expr        min         lq     median         uq       max neval
func_Row(titanicDF) 370.396605 375.210624 377.044896 385.097586 443.14042    10
func_Col(titanicDF)   6.626192   6.661266   6.675667   6.798711  10.31897    10

Notice that the results are in milliseconds for this set of data. So processing by columns is about 50 times faster than processing by rows. It's fairly straightforward to address the memory issue and retain the benefit of processing by columns by reading the data in blocks of rows. I've created a 5,300,000 row file based on the Titanic data as follows

titanicDF_big <- titanicDF
for( i in 1:12 )  titanicDF_big <- rbind(titanicDF_big, titanicDF_big)
write.table(titanicDF_big, "titanic_big.txt", row.names=FALSE )

This file can then be read in blocks of rows using the following function

read_blocks <- function(file_name, row_max = 6000000L, row_block = 5000L ) {
#   Version of code using func_Col to process data by columns
blockDF = NULL
for( row_num in seq(1, row_max, row_block)) { 
  if( is.null(blockDF) )  {
    blockDF <- read.table(file_name, header=TRUE, nrows=row_block)
    lineHolders <- func_Col(blockDF)
  }  
  else  {
    blockDF <- read.table(file_name, header=FALSE, col.names=names(blockDF),
                            nrows=row_block, skip = row_num - 1)
    lineHolders <- c(lineHolders, func_Col(blockDF))
  }
}
return(lineHolders)
}

Benchmark results using this version of read_blocks which uses func_Col to process data by columns are given below for reading the entire expanded Titanic data file with block sizes ranging from 500,000 rows to 2,000,000 rows:

Unit: seconds
                                                                 expr      min       lq       median       uq      max neval
 read_blocks("titanic_big.txt", row_max = 6000000L, row_block = 2000000L) 39.43244 39.43244 39.43244 39.43244 39.43244     1
 read_blocks("titanic_big.txt", row_max = 6000000L, row_block = 1000000L) 46.66375 46.66375 46.66375 46.66375 46.66375     1
 read_blocks("titanic_big.txt", row_max = 6000000L, row_block = 500000L) 62.51387 62.51387 62.51387 62.51387 62.51387     1

Larger block sizes give noticeably better times but require more memory. However, these results show that by processing the data by columns, the entire 5.3 million row expanded Titanic data file can be read in about a minute or less even with a block size equal to about 10% of the file size. Again, results will depend upon number of columns of data and system properties.

WaltS
  • 5,410
  • 2
  • 18
  • 24