20

I am running 64 bit R 3.1 in a 64bit Ubuntu environment with 400GB of RAM, and I am encountering a strange limitation when dealing with large matrices.

I have a numeric matrix called A, that is 4000 rows by 950,000 columns. When I try to access any element in it, I receive the following error:

Error: long vectors not supported yet: subset.c:733

Although my matrix was read in via scan, you can replicate with the following code

test <- matrix(1,4000,900000) #no error
test[1,1] #error

My Googling reveals this was a common error message prior to R 3.0, where a vector of size 2^31-1 was the limit. However, this is not the case, given my environment.

Should I not be using the native matrix type for this kind of matrix?

The_Anomaly
  • 2,385
  • 3
  • 18
  • 22
  • 2
    ["There is some support for matrices and arrays with each dimension less than 2^31 but total number of elements more than that."](http://cran.r-project.org/src/base/NEWS) Note the word "some" and the word "yet" in the error message. – Roland Jun 20 '14 at 21:23
  • 3
    Type `news()` at your console prompt and search for "LONG VECTORS" .... and begin reading. – IRTFM Jun 20 '14 at 21:36
  • That's an interesting error. Curious that `test[1]` works, as well as `test[,1][1]`. Even `test[1:2,1:2]` works, but not the original `test[1,1]`. – Andrey Shabalin Jun 20 '14 at 21:46
  • 1
    take a look to the [`ff`](http://cran.r-project.org/web/packages/ff/index.html) and [`bigmemory`](http://cran.r-project.org/web/packages/bigmemory/index.html) packages – Barranka Jun 20 '14 at 21:48
  • @AndreyShabalin Looking at the [line](https://github.com/wch/r-source/blob/trunk/src/main/subset.c#L733) in question, it appears that that case is using `LENGTH(x)`, whereas the block just above it is using `XLENGTH(x)`. As mentioned....it's a work in progress. – joran Jun 20 '14 at 21:56
  • @AndreyShabalin ...and [here](https://github.com/wch/r-source/blob/trunk/src/include/Rinternals.h#L323) is the section in the headers that sets out the difference between `LENGTH` and `XLENGTH`. – joran Jun 20 '14 at 22:04
  • @joran I was trying to make sense of that too. Notice too that the index scalars are instantiated as `R_len_t` (standard vectors) and `R_xlen_t` (long support). – Simon O'Hanlon Jun 20 '14 at 22:04
  • @joran and from line 62 above it for the values they may take. – Simon O'Hanlon Jun 20 '14 at 22:05
  • @joran, I understand that it is a work in progress. My point actually was that the large matrix is still pretty functional (except for the issue in the question). – Andrey Shabalin Jun 20 '14 at 22:24
  • This error does not occur in R 3.4.3 on Linux. – Ista Feb 15 '18 at 16:30

4 Answers4

20

A matrix is just an atomic vector with a dimension attribute which allows R to access it as a matrix. Your matrix is a vector of length 4000*9000000 which is 3.6e+10 elements (the largest integer value is approx 2.147e+9). Subsetting a long vector is supported for atomic vectors (i.e. accessing elements beyond the 2.147e+9 limit). Just treat your matrix as a long vector.

If we remember that by default R fills matrices column-wise then if we wanted to retrieve say the value at test[ 2701 , 850000 ] we could access it via:

i <- ( 2701 - 1 ) * 850000 + 2701 
test[i]
#[1] 1

Note that this really is long vector subsetting because:

2701L * 850000L
#[1] NA
#Warning message:
#In 2701L * 850000L : NAs produced by integer overflow
Simon O'Hanlon
  • 58,647
  • 14
  • 142
  • 184
  • Thanks for the great answer. Could you please help me understand your second statement? 2701L * 850000L, why would that produce NA when 2701*850000 does not? I would have thought that by specifying L, it would store it as a long integer and make it capable of handling such a large number. – The_Anomaly Jun 21 '14 at 19:02
  • because `L` explicitly specifies an integer type. `class(2701)` is `"numeric"` (similarly for 850000). (I think) R doesn't have native long integers available for the end-user (see `?integer`). (Don't know/remember why `L` is the integer code, maybe look in the R language manual ... ? – Ben Bolker Jun 21 '14 at 21:41
  • 1
    Thank you for your comment @BenBolker. For anyone else's benefit who may be reading, numeric and double are the same. So when talking about "long vector" that really just means "a vector that is long," not a vector that is indexed by a long integer, because a long integer does not exist in R. So, when Simon wrote 2701L*850000L results in NA it is because we are forcing to use the Integer type which has the limit of 2.147e9. Without the L, we are using numeric (which is double and has a much larger range). So the L has nothing to do with the long int of C :) – The_Anomaly Jun 23 '14 at 14:57
  • 1
    @The_Anomaly the `long int` type of C *was* traditionally 32bit when it was introduced (and an `int` type was 16 bits). R has been around for a while so I disagree with you and theorise (and Prof. Ripley agrees) that it *is* shorthand for `long int`. In fact I wrote a [question and answer](http://stackoverflow.com/questions/24350733/why-would-r-use-the-l-suffix-to-denote-an-integer/24350749#24350749) about this! – Simon O'Hanlon Jun 23 '14 at 15:00
  • That would make a great deal of sense--thanks for the history, Simon. – The_Anomaly Jun 23 '14 at 15:03
  • Nope, this solution is wrong. For example, when `z = matrix(1:9, 3, 3)` and `z[2, 3] # 8`, this happens `z[ (2 - 1) * 3 + 2 ] # 5`. This would be correct: `z[ (2) * 3 + 2 ] # 8` – Stingery Jun 15 '16 at 00:06
3

An alternate, quick-hand solution would be to first get the row and then the column (now the i'th element of the resulting vector) of the matrix. For example ...

test <- matrix(1,4000,900000) #no error 
test[1,1] #error
test[1, ][1] # no error

Of course, this produces some overhead, as the whole row is copied/accessed first, but it's more straightforward to read. Also works for first extracting the column and then the row.

Stingery
  • 432
  • 4
  • 16
0

TL;DR - try to remove the cache=TRUE argument from the curly braces of the chunk header.

I had this error for dataframe with 1,720,238 observations and 302 variables, which is lower than the threshold @Simon has mentioned (1,720,238*302 = 5.2e+8 < 2.147e+9)

@subhash answer was the hint that led me to try and totally remove the cache argument, which fixed the error for me.

arielhasidim
  • 694
  • 10
  • 19
-3

library(knitr)

knitr::option$set(cache = TRUE, warning = FALSE,message = FALSE, cache.lazy = FALSE)

  • 1
    what? what kind of answer is this? – raygozag Apr 08 '21 at 21:36
  • 1
    This actually helped me, because my long vector problem stemmed from `Rmd` caching mechanism, see https://bookdown.org/yihui/rmarkdown-cookbook/cache-lazy.html But the answer is so out of context, I agree! – Jan Netík Jun 16 '22 at 10:29
  • For more info, the logic behind this answer is probably from here: https://stackoverflow.com/a/41004082/7224607 – arielhasidim Jun 21 '23 at 09:12