11

Using top, I manually measured the following memory usages at the specific points designated in the comments of the following code block:

x <- matrix(rnorm(1e9),nrow=1e4) 
#~15gb
gc()
# ~7gb after gc()
y <- as.vector(x)
gc()
#~15gb after gc()

It's pretty clear that rnorm(1e9) is a ~7gb vector that's then copied to create the matrix. gc() removes the original vector since it's not assigned to anything. as.vector(x) then coerces and copies the data to vector.

My question is, why can't these three objects all point to the same memory block (at least until one is modified)? Isn't a matrix really just a vector with some additional metadata?

This is in R version 3.6.2

edit: also tested in 4.0.3, same results.

Anna
  • 5
  • 2
Michael
  • 5,808
  • 4
  • 30
  • 39
  • https://stackoverflow.com/a/2603318/13513328 – Waldi Feb 05 '21 at 10:48
  • @waldi that answer is...incomplete. it's true that objects are immutable from the user's perspective, but R uses a "copy on modify" optimization to avoid copying the value in memory until necessary. Ie, objects can be pointers to the same value until the values diverge, and only then is a copy made: See https://stackoverflow.com/questions/15759117/what-exactly-is-copy-on-modify-semantics-in-r-and-where-is-the-canonical-source – Michael Feb 05 '21 at 16:32
  • So my question is why is R considering coercion between vector and matrix a "modification" when you could envision a matrix being stored as a pointer to some metadata (nrow, ncol, and a pointer to a vector). I'm asking: what are the specifics of matrix storage in R that preclude this optimization from ocurring here? – Michael Feb 05 '21 at 16:37
  • 1
    thanks for this clarification – Waldi Feb 05 '21 at 17:23
  • 1
    relevant: torch in R has zero-copy reshaping (https://torch.mlverse.org/technical/tensors/) – Michael Feb 07 '21 at 04:38

2 Answers2

7

The question you're asking is to the reasoning. That seems more suited for R-devel, and I am assuming the answer in return is "no one knows". The relevant function from R-source is the do_asvector function.

Going down the source code of a call to as.vector(matrix(...)), it is important to note that the default argument for mode is any. This translates to ANYSXP (see R internals). This lets us find the evil culprit (line 1524) of the copy-behaviour.

// source reference: do_asvector
...
    if(type == ANYSXP || TYPEOF(x) == type) {
    switch(TYPEOF(x)) {
    case LGLSXP:
    case INTSXP:
    case REALSXP:
    case CPLXSXP:
    case STRSXP:
    case RAWSXP:
        if(ATTRIB(x) == R_NilValue) return x;
        ans  = MAYBE_REFERENCED(x) ? duplicate(x) : x; // <== evil culprit
        CLEAR_ATTRIB(ans);
        return ans;
    case EXPRSXP:
    case VECSXP:
        return x;
    default:
        ;
    }
...

Going one step further, we can find the definition for MAYBE_REFERENCED in src/include/Rinternals.h, and by digging a bit we can find that it checks whether sxpinfo.named is equal to 0 (false) or not (true). What I am guessing here is that the assignment operator <- increments the sxpinfo.named counter and thus MAYBE_REFERENCED(x) returns TRUE and we get a duplicate (deep copy).

However, Is this behaviour necessary?

That is a great question. If we had given an argument to mode other than any or class(x) (same as our input class), we skip the duplicate line, and we continue down the function, until we hit a ascommon. So I dug a bit extra and took a look at the source code for ascommon, we can see that if we were to try and convert to list manually (setting mode = "list"), ascommon only calls shallowDuplicate.

// Source reference: ascommon
---
    if ((type == LISTSXP) &&
        !(TYPEOF(u) == LANGSXP || TYPEOF(u) == LISTSXP ||
          TYPEOF(u) == EXPRSXP || TYPEOF(u) == VECSXP)) {
        if (MAYBE_REFERENCED(v)) v = shallow_duplicate(v); // <=== ascommon duplication behaviour
        CLEAR_ATTRIB(v);
    }
    return v;
    }
---

So one could imagine that the call to duplicate in do_asvector could be replaced by a call to shallow_duplicate. Perhaps a "better safe than sorry" strategy was chosen when the code was originally implemented (prior to R-2.13.0 according to a comment in the source code), or perhaps there is a scenario in one of the types not handled by ascommon that requires a deep-copy.

For now I would test if the function does a deep-copy if we set mode='list' or pass the list without assignment. In either case it might not be a bad idea to send a follow-up question to the R-devel mailing list.

Edit: <- behaviour

I took the liberty to confirm my suspicion, and looked at the source code for <-. I previously stated that I assumed that <- incremented sxpinfo.named, and we can confirm this by looking at do_set (the c source code for <-). When assigning as x <- ... x is a SYMSXP, and this we can see that the source code calls INCREMENT_NAMED which in turn calls SET_NAMED(x, NAMED(X) + 1). So everything else equal we should see a copy behaviour for x <- matrix(...); y <- as.vector(x) while we shouldn't for y <- as.vector(matrix(...)).

Oliver
  • 8,169
  • 3
  • 15
  • 37
  • 1
    Doesn't `shallow_duplicate` do the same thing as `duplicate` on atomic vectors? – user2554330 Feb 08 '21 at 09:23
  • 1
    Honestly I didn't know the answer to your question. But once again with some digging, `shallow_duplicate` indeed calls `DUPLICATE_ATOMIC_VECTOR` (see: src/main/duplicate.c) which in turn performs a full memory copy, but a shallow copy of attributes (interesting choice). I would imagine that the time spent on creating a contiguous memory block would outweigh other costs. So I am not sure I agree with the efficiency reasons regardless. But you are right, in the specific case of a giving a matrix (atomic vector) as input, using shallow_dup would not make a difference. It might elsewhere. :-) – Oliver Feb 08 '21 at 11:55
3

At the final gc(), you have x pointing to a vector with a dim attribute, and y pointing to a vector without any dim attribute. The data is an intrinsic part of the object, it's not an attribute, so those two vectors have to be different.

If matrices had been implemented as lists, e.g.

 x <- list(data = rnorm(1e9), dim = c(1e4, 1e5))

then a shallow copy would be possible, but that's not how it was done. You can read the details of the internal structure of objects in the R Internals manual. For the current release, that's here: https://cloud.r-project.org/doc/manuals/r-release/R-ints.html#SEXPs .

You may wonder why things were implemented this way. I suspect it's intended to be efficient for the common use cases. Converting a matrix to a vector isn't generally necessary (you can treat x as a vector already, e.g. x[100000] and y[100000] will give the same value), so there's no need for "convert to vector" to be efficient. On the other hand, extracting elements is very common, so you don't want to have an extra pointer dereference slowing that down.

user2554330
  • 37,248
  • 4
  • 43
  • 90