36

I am writing R code to create a square matrix. So my approach is:

  1. Allocate a matrix of the correct size
  2. Loop through each element of my matrix and fill it with an appropriate value

My question is really simple: what is the best way to pre-allocate this matrix? Thus far, I have two ways:

> x <- matrix(data=NA,nrow=3,ncol=3)
> x
     [,1] [,2] [,3]
[1,]   NA   NA   NA
[2,]   NA   NA   NA
[3,]   NA   NA   NA

or

> x <- list()
> length(x) <- 3^2
> dim(x) <- c(3,3)
> x
     [,1] [,2] [,3]
[1,] NULL NULL NULL
[2,] NULL NULL NULL
[3,] NULL NULL NULL

As far as I can see, the former is a more concise method than the latter. Also, the former fills the matrix with NAs, whereas the latter is filled with NULLs.

Which is the "better" way to do this? In this case, I'm defining "better" as "better performance", because this is statistical computing and this operation will be taking place with large datasets.

While the former is more concise, it isn't breathtakingly easier to understand, so I feel like this could go either way.

Also, what is the difference between NA and NULL in R? ?NA and ?NULL tell me that "NA" has a length of "1" whereas NULL has a length of "0" - but is there more here? Or a best practice? This will affect which method I use to create my matrix.

poundifdef
  • 18,726
  • 23
  • 95
  • 134
  • 1
    Not asked is, why do you want to *loop* over the elements of your matrix? Is it possible you can use a vectorized operation instead? That should be your next question here! :) – Harlan Nov 17 '09 at 14:17
  • @Harlan that is basically what I am getting at in this question here: http://stackoverflow.com/questions/1719447/outer-equivalent-for-non-vector-lists-in-r. If you have a suggestion, I'd love to hear it! – poundifdef Nov 17 '09 at 15:03

3 Answers3

47

When in doubt, test yourself. The first approach is both easier and faster.

> create.matrix <- function(size) {
+ x <- matrix()
+ length(x) <- size^2
+ dim(x) <- c(size,size)
+ x
+ }
> 
> system.time(x <- matrix(data=NA,nrow=10000,ncol=10000))
   user  system elapsed 
   4.59    0.23    4.84 
> system.time(y <- create.matrix(size=10000))
   user  system elapsed 
   0.59    0.97   15.81 
> identical(x,y)
[1] TRUE

Regarding the difference between NA and NULL:

There are actually four special constants.

In addition, there are four special constants, NULL, NA, Inf, and NaN.

NULL is used to indicate the empty object. NA is used for absent (“Not Available”) data values. Inf denotes infinity and NaN is not-a-number in the IEEE floating point calculus (results of the operations respectively 1/0 and 0/0, for instance).

You can read more in the R manual on language definition.

Shane
  • 98,550
  • 35
  • 224
  • 217
  • If you're going to be comparing the methods shouldn't you have wrapped both methods in a function? There is overhead added on with the function call. – Dason Feb 04 '12 at 17:11
  • @Dason In practice, if one were to use the longer method frequently, wouldn't it probably be wrapped in a function? Whereas `matrix` would remain as is. – Gregor Thomas Aug 31 '12 at 18:04
4

According to this article we can do better than preallocating with NA by preallocating with NA_real_. From the article:

as soon as you assign a numeric value to any of the cells in 'x', the matrix will first have to be coerced to numeric when a new value is assigned. The originally allocated logical matrix was allocated in vain and just adds an unnecessary memory footprint and extra work for the garbage collector. Instead allocate it using NA_real_ (or NA_integer_ for integers)

As recommended: let's test it.

testfloat = function(mat){
  n=nrow(mat)
  for(i in 1:n){
    mat[i,] = 1.2
  }
}

>system.time(testfloat(matrix(data=NA,nrow=1e4,ncol=1e4)))
user  system elapsed 
3.08    0.24    3.32 
> system.time(testfloat(matrix(data=NA_real_,nrow=1e4,ncol=1e4)))
user  system elapsed 
2.91    0.23    3.14 

And for integers:

testint = function(mat){
  n=nrow(mat)
  for(i in 1:n){
    mat[i,] = 3
  }
}

> system.time(testint(matrix(data=NA,nrow=1e4,ncol=1e4)))
user  system elapsed 
2.96    0.29    3.31 
> system.time(testint(matrix(data=NA_integer_,nrow=1e4,ncol=1e4)))
user  system elapsed 
2.92    0.35    3.28 

The difference is small in my test cases, but it's there.

David Marx
  • 8,172
  • 3
  • 45
  • 66
  • 1
    `3` is is of class `numeric` you probably meant 3L. Nevertheless I don't see any measurable performance impact of using just NA (even if measuring for 20 s each and making sure gc does its job) – jan-glx Aug 16 '15 at 10:22
0
rows<-3
cols<-3    
x<-rep(NA, rows*cols)
x1 <- matrix(x,nrow=rows,ncol=cols)