2

I'm looking for a way to store the numeric vectors of data-frame in a more compact way.

I use data from a household survey (PNAD in Brazil) with ~400k observations and ~200 questions. Once imported, the data uses ~500Mb of memory in R, but only 180Mb in Stata. This is because there is a 'compress' command in Stata, that will look at the contents of each variable (vector) and coerce it to its most compact type. For instance, a double numeric variable (8 bytes) containing only integers ranging from 1 to 100 will be converted to a "byte" type (small int). It does something similar for string variables (vectors), reducing the variable string size to that of its largest element.

I know I could use the 'colClasses' argument in the read.table function and explicitly declare the types (as here). But that is time-consuming and sometimes the survey documentation will not be explicit about types beyond numeric vs. string. Also, I know 500Mb is not the end of the world these days, but appending surveys for different years starts getting big.

I'm amazed I could not find something equivalent in R, which also memory constrained (I know out of memory is possible, but more complicated). How can there be a 3x memory gain laying around?

After reading a bit, my question boils down to:

1) Why there is no one "byte" atomic vector types in R? ) This could be used to store small integers (from -127 to 100, as in Stata) and logicals (as discussed in this SO question. This would be very useful as surveys normally contain many questions with small int. values (age, categorical questions, etc). The other SO question mentions the 'bit' package, for 1 bit logical, but that is a bit too extreme because of loosing the NA value. Implementing a new atomic type and predicting the broader consequences is way above my league, though.

2) Is there an equivalent command to 'compress' in R? (here is a similar SO question).

If there is no command, I wrote the code bellow that coerces vectors that contain integers stored as "doubles" to integers. This should cut memory allocation by half for such vectors, without loosing any information.

compress <- function(x){
  if(is.data.frame(x)==TRUE){
    for(i in 1:ncol(x)){
      if(sum(!(x[,i] == as.integer(x[,i])))==0){
        x[,i]  <- as.integer(x[,i])
      }      
    }  
  }
return(x)
}

object.size(mtcars)             # output 6736 bytes
object.size(compress(mtcars))   # output 5968 bytes

Are there risks in this conversion ? Help is also appreciated in making this code more efficient.

Community
  • 1
  • 1
LucasMation
  • 2,408
  • 2
  • 22
  • 45
  • 1
    How are you reading this data in? Normally `read.csv(...)` is pretty smart about converting numeric to integer when justified. Also, this: `new.df <- data.frame(lapply(df,function(col) {if(all(as.integer(col)==col))as.integer(col) else col}))` will be a lot faster than your loop. – jlhoward Dec 29 '14 at 22:10
  • @jlhoward: you are correct, the data is already being imported as integer by read.csv(). So the memory efficiency in Stata comes from storing the small int data, which is the content of most questions in the dataset, as a 1 byte vector, instead of 4 bytes in R. – LucasMation Dec 29 '14 at 23:45
  • 1
    I can't offer a concrete example, but `adegenet` package uses only one byte to code some data. See more details [here](http://bioinformatics.oxfordjournals.org/content/early/2011/09/16/bioinformatics.btr521.full.pdf+html) and in the package code itself. – Roman Luštrik Dec 30 '14 at 08:41

0 Answers0